“Copy/Paste is the mother of learning.”
“Repetition! Repetition is the mother of learning.”
Sources: GitHub | Google Drive | OneDrive
Assumption: Working directory has sub-folders named "data", "images", "code", "docs".
# #R Version
R.version.string
## [1] "R version 4.1.2 (2021-11-01)"# #Working Directory
getwd()
## [1] "D:/Analytics/xADSM"# #Version information about R, the OS and attached or loaded packages
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 LC_MONETARY=English_India.1252
## [4] LC_NUMERIC=C LC_TIME=English_India.1252
##
## attached base packages:
## [1] compiler grid stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] psych_2.1.9 scales_1.1.1 viridisLite_0.4.0 microbenchmark_1.4.9
## [5] ggpmisc_0.4.4 ggpp_0.4.2 qcc_2.7 VIM_6.1.1
## [9] colorspace_2.0-2 mice_3.13.0 Lahman_9.0-0 gapminder_0.3.0
## [13] nycflights13_1.0.2 gifski_1.4.3-1 data.table_1.14.2 zoo_1.8-9
## [17] car_3.0-12 carData_3.0-4 lubridate_1.8.0 e1071_1.7-9
## [21] latex2exp_0.5.0 readxl_1.3.1 kableExtra_1.3.4 forcats_0.5.1
## [25] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.1.0
## [29] tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5 conflicted_1.0.4
##
## loaded via a namespace (and not attached):
## [1] ellipsis_0.3.2 class_7.3-19 rstudioapi_0.13 proxy_0.4-26
## [5] farver_2.1.0 listenv_0.8.0 MatrixModels_0.5-0 prodlim_2019.11.13
## [9] fansi_0.5.0 ranger_0.13.1 xml2_1.3.2 codetools_0.2-18
## [13] splines_4.1.2 mnormt_2.0.2 cachem_1.0.6 robustbase_0.93-9
## [17] knitr_1.36 jsonlite_1.7.2 pROC_1.18.0 caret_6.0-90
## [21] broom_0.7.10 httr_1.4.2 backports_1.3.0 assertthat_0.2.1
## [25] Matrix_1.3-4 fastmap_1.1.0 htmltools_0.5.2 quantreg_5.86
## [29] tools_4.1.2 gtable_0.3.0 glue_1.5.0 reshape2_1.4.4
## [33] Rcpp_1.0.7 cellranger_1.1.0 jquerylib_0.1.4 vctrs_0.3.8
## [37] svglite_2.0.0 nlme_3.1-153 conquer_1.2.1 iterators_1.0.13
## [41] lmtest_0.9-39 timeDate_3043.102 xfun_0.28 gower_0.2.2
## [45] laeken_0.5.2 globals_0.14.0 rvest_1.0.2 lifecycle_1.0.1
## [49] future_1.23.0 DEoptimR_1.0-9 MASS_7.3-54 ipred_0.9-12
## [53] hms_1.1.1 parallel_4.1.2 SparseM_1.81 yaml_2.2.1
## [57] sass_0.4.0 rpart_4.1-15 stringi_1.7.5 foreach_1.5.1
## [61] boot_1.3-28 lava_1.6.10 matrixStats_0.61.0 rlang_0.4.12
## [65] pkgconfig_2.0.3 systemfonts_1.0.3 evaluate_0.14 lattice_0.20-45
## [69] recipes_0.1.17 tidyselect_1.1.1 parallelly_1.29.0 plyr_1.8.6
## [73] magrittr_2.0.1 bookdown_0.24 R6_2.5.1 generics_0.1.1
## [77] DBI_1.1.1 pillar_1.6.4 withr_2.4.2 survival_3.2-13
## [81] abind_1.4-5 sp_1.4-6 nnet_7.3-16 future.apply_1.8.1
## [85] crayon_1.4.2 utf8_1.2.2 tmvnsim_1.0-2 tzdb_0.2.0
## [89] rmarkdown_2.11 ModelMetrics_1.2.2.2 vcd_1.4-9 digest_0.6.28
## [93] webshot_0.5.2 stats4_4.1.2 munsell_0.5.0 bslib_0.3.1# #Pandoc Version being used by RStudio
rmarkdown::pandoc_version()
## [1] '2.14.0.3'I wanted to have a single document containing Notes, Codes, & Output for a quick reference for the lectures. Combination of multiple file formats (docx, csv, xlsx, R, png etc.) was not working out for me. So, I found the Bookdown package to generate this HTML file.
All of us had to stumble through some of the most common problems individually and as we are approaching deeper topics, a more collaborative approach might be more beneficial.
Further, the lectures are highly focused and thus I had to explore some sidetopics in more details to get the most benefit from them. I have included those topics and I am also interested in knowing about your experiences too.
Towards that goal, I am sharing these notes and hoping that you would run the code in your own environment and would raise any queries, problems, or difference in outcomes. Any suggestion or criticism is welcome. I have tried to not produce any significant changes in your working environment. Please let me know if you observe otherwise.
Currently, my priority is to get in sync with the ongoing lectures. The time constraint has led to issues given below. These will be corrected as and when possible.
Last, but not the least, I am also learning while creating this, so if you think I am wrong somewhere, please point it out. I am always open for suggestions.
Thank You all for the encouragement.
Shivam
R is Case-sensitive i.e. c() not
C()and View() notview()
Hash Sign “#” comments out anything after it, till the newline. There are no multiline comments.
Backslash “\” is reserved to escape the character that follows it.
Escape key stops the parser i.e. “+” sign where R is waiting for more input before evaluation.
Overview
Execute the current expression in Source Pane (Top) by ‘Run’ Button or "Ctrl+ Enter"
Execute the current expression in Console Pane (Bottom) by “Enter”
Windows 10 uses backslash “\” for PATH. R, however, uses slash “/.” Backslash “\” is escape character in R.
In R Studio, Set Working Directory by:
# #Current Working Directory
getwd()
## [1] "D:/Analytics/xADSM"
#
# #R Installation Directory (Old DOS Convention i.e. ~1 after 6 letters)
R.home()
## [1] "C:/PROGRA~1/R/R-41~1.2"
Sys.getenv("R_HOME")
## [1] "C:/PROGRA~1/R/R-41~1.2"
#
# #This is Wrapped in IF Block to prevent accidental execution
if(FALSE){
# #WARNING: This will change your Working Directory
setwd("~")
}If the R program is written over the console, line by line, then the output is printed automatically i.e. no function needed for printing. This is called implicit printing.
Inside an R Script File, implicit printing does not work and the expression needs to be printed explicitly.
In R, the most common method to print the output ‘explicitly’ is by the function print().
# #Implicit Printing: This will NOT be printed to Console, if it is inside an R Script.
"Hello World!"
#
# #Implicit Printing using '()': Same as above
("Hello World!")
#
# #Explicit Printing using print() : To print Objects to Console, even inside an R Script.
print("Hello World!")
## [1] "Hello World!"Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. Expressions for evaluation are themselves objects; Evaluation consists of taking the object representing an expression and returning the object that is the value of that expression.
# #ls(): List ALL Objects in the Current NameSpace (Environment)
ls()
## character(0)Caution: Always use “<-” for the assignment, NOT the “=”
While the “=” can be used for assignment, its usage for assignment is highly discouraged because it may behave differently under certain subtle conditions which are difficult to debug. Convention is to use “=” only during function calls for arguments association (syntactic token).
There are 5 assignment operators (<-, =, <<-, ->, ->>), others are not going to be discussed for now.
All the created objects are listed in the Environment Tab of the Top Right Pane.
# #Assignment Operator "<-" is used to assign any value (ex: 10) to any object (ex: 'bb')
bb <- 10
#
# #Print Object
print(bb)
## [1] 10In the Environment Tab, any object can be selected and deleted using Brush.
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] 10
#
# #Remove an Object
rm(bb)
#
# #Equivalent
if(FALSE) {rm("bb")} #Same
if(FALSE) {rm(list = "bb")} #Faster, verbose, and would not work without quotes
#
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] "Error in print(bb): object 'bb' not found\n"9.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
9.2 Elements are the entities on which data are collected. (Generally ROWS)
9.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
9.4 The set of measurements obtained for a particular element is called an observation.
9.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.
R has 6 basic data types (logical, integer, double, character, complex, and raw). These data types can be combined to form Data Structures (vector, list, matrix, dataframe, factor etc.). Refer What is a Vector!
Atomic vectors are homogeneous i.e. each component has the same datatype. A vector type can be checked with the typeof() or class() function. Its length, i.e. the number of elements in the vector, can be checked with the function length().
If the output of an expression does not show numbers in brackets like ‘[1]’ then it is a ’NULL’ type return. [Numbers] show that it is a Vector. Ex: str() and cat() outputs are of NULL Type.
Use function c() to create a vector (or a list) -
NULL < raw < logical < integer < double < complex < character < list < expression.Caution: Colon “:” might produce unexpected length of vectors (in case of 0-length vectors). Suggestion: Use colon only with hardcoded numbers i.e. “1:10” is ok, “1:n” is dangerous and should be avoided.
Caution: seq() function might produce unexpected type of vectors (in case of 1-length vectors). Suggestion: Use seq_along(), seq_len().
# #To know about an Object: str(), class(), length(), dim(), typeof(), is(), attributes(), names()
# #Integer: To declare as integer "L" (NOT "l") is needed
ii_int <- c(1L, 2L, 3L, 4L, 5L)
str(ii_int)
## int [1:5] 1 2 3 4 5
#
# #Double (& Default)
dd_dbl <- c(1, 2, 3, 4, 5)
str(dd_dbl)
## num [1:5] 1 2 3 4 5
#
# #Character
cc_chr <- c('a', 'b', 'c', 'd', 'e')
str(cc_chr)
## chr [1:5] "a" "b" "c" "d" "e"
#
# #Logical
ll_lgl <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
str(ll_lgl)
## logi [1:5] TRUE FALSE FALSE TRUE TRUE# #Integer Vector of Length 1
nn <- 5L
#
# #Colon ":" Operator - Avoid its usage
str(c(1:nn))
## int [1:5] 1 2 3 4 5
c(typeof(pi:6), typeof(6:pi))
## [1] "double" "integer"
#
# #seq() - Avoid its usage
str(seq(1, nn))
## int [1:5] 1 2 3 4 5
str(seq(1, nn, 1))
## num [1:5] 1 2 3 4 5
str(seq(1, nn, 1L))
## num [1:5] 1 2 3 4 5
str(seq(1L, nn, 1L))
## int [1:5] 1 2 3 4 5
#
# #seq_len()
str(seq_len(nn))
## int [1:5] 1 2 3 4 5str(seq(1, 5, 1))
## num [1:5] 1 2 3 4 5str(letters[1:5])
## chr [1:5] "a" "b" "c" "d" "e"str(1:5 %% 2 == 0)
## logi [1:5] FALSE TRUE FALSE TRUE FALSE# #Create Two Vectors
income <- c(100, 200, 300, 400, 500)
gender <- c("male", "female", "female", "female", "male")
#
# #Create a DataFrame
bb <- data.frame(income, gender)
#
# #Print or View DataFrame
#View(bb)
print(bb)
## income gender
## 1 100 male
## 2 200 female
## 3 300 female
## 4 400 female
## 5 500 male
#
# #Struture
str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
#
# #Names
names(bb)
## [1] "income" "gender"R Script file extension is “.R”
"Ctrl+ S" will Open Save Window at Working Directory.
"Ctrl+ O" will Open the Browse Window at Working Directory.
# #Subdirectory "data" has data files like .csv .rds .txt .xlsx
# #Subdirectory "code" has scripts files like .R
# #Subdirectory "images" has images like .png
#
# #Check if a File exists
path_relative <- "data/aa.xlsx" #Relative Path
#
if(file.exists(path_relative)) {
cat("File Exists\n")
} else {
cat(paste0("File does not exist at ", getwd(), "/", path_relative, "\n"))
}
## File Exists
#
if(exists("XL", envir = .z)) {
cat(paste0("Absolute Path exists as: ", .z$XL, "\n"))
path_absolute <- paste0(.z$XL, "aa", ".xlsx") #Absolute Path
#
if(file.exists(path_absolute)) {
cat("File Exists\n")
} else {
cat(paste0("File does not exist at ", path_absolute, "\n"))
}
} else {
cat(paste0("Object 'XL' inside Hidden Environment '.z' does not exist. \n",
"It is probably File Path of the Author, Replace the File Path from Your own Directory\n"))
}
## Absolute Path exists as: D:/Analytics/xADSM/data/
## File Existswrite.csv() and read.csv() combination can be used to export data and import it back into R. But, it has some limitations -
xx_data <- bb
#
# #Write a dataframe to a CSV File
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from the CSV into a dataframe
yy_data <- read.csv("data/B09_xx_data.csv")
#
# #Check if the object being read is same as the obejct that was written
identical(xx_data, yy_data)
## [1] FALSE# #Exercise to show how to match the objects being imported /exported from CSV
str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
xx_data <- bb
# #Write to CSV
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from CSV by providing row.names Column and colClasses()
yy_data <- read.csv("data/B09_xx_data.csv", row.names = 1,
colClasses = c('character', 'numeric', 'character'))
#
# #Coerce row.names attribute to integer
attr(yy_data, "row.names") <- as.integer(attr(yy_data, "row.names"))
#
# #Check if the objects are identical
identical(xx_data, yy_data)
## [1] TRUE
stopifnot(identical(xx_data, yy_data))str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
xx_data <- bb
#
# #Save the Object as RDS File
saveRDS(xx_data, "data/B09_xx_data.rds")
#
# #Read from the RDS File
yy_data <- readRDS("data/B09_xx_data.rds")
#
# #Objects are identical (No additional transformations are needed)
identical(xx_data, yy_data)
## [1] TRUEstr(xx_data)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
# #Adding a Column to a dataframe
xx_data <- data.frame(xx_data, age = 22:26)
#
# #Adding a Column to a dataframe by adding a Vector
x_age <- 22:26
xx_data <- data.frame(xx_data, x_age)
str(xx_data)
## 'data.frame': 5 obs. of 4 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ age : int 22 23 24 25 26
## $ x_age : int 22 23 24 25 26
#
# #Adding a Column to a dataframe by using dollar "$"
xx_data$age1 <- x_age
#
# #Adding a Blank Column using NA
xx_data$blank <- NA
#
# #Editing of a dataframe can also be done
# edit(xx_data)
str(xx_data)
## 'data.frame': 5 obs. of 6 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ age : int 22 23 24 25 26
## $ x_age : int 22 23 24 25 26
## $ age1 : int 22 23 24 25 26
## $ blank : logi NA NA NA NA NA
#
# #Removing a Column by subsetting
xx_data <- xx_data[ , -c(3)]
#
# #Removing a Column using NULL
xx_data$age1 <- NULL
str(xx_data)
## 'data.frame': 5 obs. of 4 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ x_age : int 22 23 24 25 26
## $ blank : logi NA NA NA NA NAPackages include reusable functions, the documentation that describes how to use them, and sample data.
In R Studio: Packages Tab | Install | Package Name = “psych” | Install
if(FALSE){
# #WARNING: This will install packages and R Studio will NOT work for that duration
# #Install Packages and their dependencies
install.packages("psych", dependencies = TRUE)
}# #Load a Package with or without Quotes
library(readxl)
library("readr")# #Load Multiple Packages
pkg_chr <- c("ggplot2", "tibble", "tidyr", "readr", "dplyr")
#lapply(pkg_chr, FUN = function(x) {library(x, character.only = TRUE)})
#
# #Load Multiple Packages, Suppress Startup Messages, and No console output
invisible(lapply(pkg_chr, FUN = function(x) {
suppressMessages(library(x, character.only = TRUE))}))# #Detach a package
#detach("package:psych", unload = TRUE)
#
# #Search Package in the already loaded packages
pkg_chr <- "psych"
if (pkg_chr %in% .packages()) {
# #Detach a package that has been loaded previously
detach(paste0("package:", pkg_chr), character.only = TRUE, unload = TRUE)
}To Import Excel in R Studio : Environment | Dropdown | From Excel | Browse
Object imported by read.csv() i.e. ‘mydata’ is NOT same as the one imported by read_excel() i.e. ‘mydata_xl’
All of these objects can be converted into any other form as needed i.e. dataframe to tibble or vice-versa.
# #Data File Name has been modified to include lecture number "B09"
# #All Data Files are in the sub-directory named 'data'
mydata <- read.csv("data/B09-FLIGHTS.csv")
#
# #To Copy from Clipboard, assuming copied from xlsx i.e. tab separated data
mydata_clip <- read.csv("clipboard", sep = '\t', header = TRUE)# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "b09_flights", ".rds")
b09_flights <- NULL
if(file.exists(xx_rds)) {
b09_flights <- readRDS(xx_rds)
} else {
# #Read CSV
b09_flights <- read.csv(xx_csv)
# #Write Object as RDS
saveRDS(b09_flights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata <- b09_flightsstr(mydata)
## 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : int 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : int 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : int 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : int 1400 1416 1089 1576 762 719 1065 229 944 733 ...
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : int 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : chr "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...head(mydata)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
## 1 2013 1 1 517 515 2 830 819 11 UA 1545
## 2 2013 1 1 533 529 4 850 830 20 UA 1714
## 3 2013 1 1 542 540 2 923 850 33 AA 1141
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696
## tailnum origin dest air_time distance hour minute time_hour
## 1 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
## 2 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
## 3 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
## 4 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
## 5 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
## 6 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00tail(mydata)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## 336771 2013 9 30 NA 1842 NA NA 2019 NA EV
## 336772 2013 9 30 NA 1455 NA NA 1634 NA 9E
## 336773 2013 9 30 NA 2200 NA NA 2312 NA 9E
## 336774 2013 9 30 NA 1210 NA NA 1330 NA MQ
## 336775 2013 9 30 NA 1159 NA NA 1344 NA MQ
## 336776 2013 9 30 NA 840 NA NA 1020 NA MQ
## flight tailnum origin dest air_time distance hour minute time_hour
## 336771 5274 N740EV LGA BNA NA 764 18 42 2013-09-30 18:00:00
## 336772 3393 <NA> JFK DCA NA 213 14 55 2013-09-30 14:00:00
## 336773 3525 <NA> LGA SYR NA 198 22 0 2013-09-30 22:00:00
## 336774 3461 N535MQ LGA BNA NA 764 12 10 2013-09-30 12:00:00
## 336775 3572 N511MQ LGA CLE NA 419 11 59 2013-09-30 11:00:00
## 336776 3531 N839MQ LGA RDU NA 431 8 40 2013-09-30 08:00:00# #library(readxl)
mydata_xl <- read_excel("data/B09-FLIGHTS.xlsx", sheet = "FLIGHTS")# #library(readxl)
xx_xl <- paste0("data/", "B09-FLIGHTS", ".xlsx")
xx_rds_xl <- paste0("data/", "b09_flights_xls", ".rds")
b09_flights_xls <- NULL
if(file.exists(xx_rds_xl)) {
b09_flights_xls <- readRDS(xx_rds_xl)
} else {
b09_flights_xls <- read_excel(xx_xl, sheet = "FLIGHTS")
saveRDS(b09_flights_xls, xx_rds_xl)
}
rm(xx_xl, xx_rds_xl)
mydata_xl <- b09_flights_xls
#str(mydata_xl)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:336776] 2013 2013 2013 2013 2013 ...
## $ month : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : chr [1:336776] "517" "533" "542" "544" ...
## $ sched_dep_time: num [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : chr [1:336776] "2" "4" "2" "-1" ...
## $ arr_time : chr [1:336776] "830" "850" "923" "1004" ...
## $ sched_arr_time: num [1:336776] 819 830 850 1022 837 ...
## $ arr_delay : chr [1:336776] "11" "20" "33" "-18" ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : num [1:336776] 1545 1714 1141 725 461 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : chr [1:336776] "227" "227" "160" "183" ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
# #library(readr)
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "xxflights", ".rds")
xxflights <- NULL
if(file.exists(xx_rds)) {
xxflights <- readRDS(xx_rds)
} else {
xxflights <- read_csv(xx_csv, show_col_types = FALSE)
attr(xxflights, "spec") <- NULL
attr(xxflights, "problems") <- NULL
saveRDS(xxflights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata_rdr <- xxflights# #Subset All Rows and last 3 columns
data6 <- mydata[ , c(17:19)]
str(data6)
## 'data.frame': 336776 obs. of 3 variables:
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : int 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour: chr "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# #Subset by deleting the 1:16 columns
data7 <- mydata[ , -c(1:16)]
stopifnot(identical(data6, data7))Caution: Attaching a Dataset should be avoided to prevent unexpected behaviour due to ‘masking.’ Using full scope resolution i.e. ‘data_frame$column_header’ would result in fewer bugs. However, if a Dataset has been attached, please ensure that it is detached also.
if(FALSE){
# #WARNING: Attaching a Dataset is discouraged because of 'masking'
# #'dep_time' is Column Header of a dataframe 'mydata'
tryCatch(str(dep_time), error = function(e) print(paste0(e)))
## [1] "Error in str(dep_time): object 'dep_time' not found\n"
# #Attach the Dataset
attach(mydata)
# #Now all the column headers are accessible without the $ sign
str(dep_time)
## int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
# #But, there are other datasets also, attaching another one results in MESSAGE
attach(mydata_xl)
## The following objects are masked from mydata:
##
## air_time, arr_delay, arr_time, carrier, day, dep_delay, dep_time, dest,
## distance, flight, hour, minute, month, origin, sched_arr_time,
## sched_dep_time, tailnum, time_hour, year
str(dep_time)
## chr [1:336776] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558" "558" ...
#
# #'mydata_xl$dep_time' masked the already present 'mydata$dep_time'.
# #Thus now it is showing as 'chr' in place of original 'int'
# #Column Header Names can be highly varied and those will silently mask other variable
# #Hence, attaching a dataset would result in random bugs or unexpected behaviours
#
# #Detach a Dataset
detach(mydata_xl)
detach(mydata)
}Figure 1.1 Correlation using psych::pairs.panels()
# # Subset 3 Columns and 1,00,000 rows
x_rows <- 100000L
data_pairs <- mydata[1:x_rows, c(7, 16, 9)]
#
# #Equivalent
data_pairs <- mydata %>%
select(air_time, distance, arr_delay) %>%
slice_head(n = x_rows)
#
if( nrow(data_pairs) * ncol(data_pairs) > 1000000 ) {
print("Please reduce the number of points to a sane number!")
ggplot()
} else {
#B09P01
pairs.panels(data_pairs)
}These allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.
An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. To know more Go to Rstudio
To know more about Google Colab Go to Google Colab
NOTE: As I am not using Google Colab, the workflow explained between 00:00 to 35:10 is NOT covered here. If someone is using Google Colab, and is willing to share their notes, I would include those.
Base R graphs /plots as shown in figure 2.1
Figure 2.1 Flights: Arrival Time (Y) vs. Departure (X) Time
# #Create a Subset of Dataframe of 1000 Rows for quick calculations
bb <- head(mydata, 1000)
#
# #Dimensions: dim() Row x Column; nrow(); ncol()
dim(bb)
## [1] 1000 19
#
stopifnot(identical(nrow(bb), dim(bb)[1]))
stopifnot(identical(ncol(bb), dim(bb)[2]))# #Split a Dataframe by subsetting
data_1 <- bb[ , 1:8]
data_2 <- bb[ , 9:19]
# str(data_1)# #Merge a Dataframe by cbind()
data_3 <- cbind(data_1, data_2)
# #Equivalent
data_4 <- data.frame(data_1, data_2)
# str(bb_3)
stopifnot(identical(data_3, data_4))# #Row Split
data_5 <- bb[1:300, ]
data_6 <- bb[301:1000, ]
#
# #Equivalent
n_rows <- 300L
data_5 <- bb[1:n_rows, ]
data_6 <- bb[(n_rows + 1L):nrow(bb), ]
#
stopifnot(identical(data_5, head(bb, n_rows)))
stopifnot(identical(data_6, tail(bb, (nrow(bb) - n_rows))))# #Merge a Dataframe by rbind()
data_7 <- rbind(data_5, data_6)
stopifnot(identical(bb, data_7))# #Change A Specific Name based on Index Ex: First Header "year" -> "YEAR"
# #NOTE: Output of 'names(bb)' is a character vector, not a dataframe
# #So, [1] is being used to subset for 1st element and NOT the [ , 1] (as done for dataframe)
(names(bb)[1] <- "YEAR")
## [1] "YEAR"
#
# #Change all Column Headers to Uppercase by toupper() or Lowercase by tolower()
names(bb) <- toupper(names(bb))NA can be coerced to any other vector type except raw. There are also constants like NA_integer_, NA_real_ etc. For checking only the presence of NA, anyNA() is faster than is.na()
Overview of ‘Not Available’
To remove all NA
bb <- xxflights
# #anyNA() is faster than is.na()
if(anyNA(bb)) print("NA are Present!") else print("NA not found")
## [1] "NA are Present!"
#
# #Columnwise NA Count
bb_na_col <- colSums(is.na(bb))
#
# #Vector of Columns having NA
which(bb_na_col != 0)
## dep_time dep_delay arr_time arr_delay tailnum air_time
## 4 6 7 9 12 15
stopifnot(identical(which(bb_na_col != 0), which(vapply(bb, anyNA, logical(1)))))
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 472 478 616 644 726 734
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 9430
#
# #How many rows have NA in specific Columns
sum(!complete.cases(bb[, c(6, 9, 4)]))
## [1] 9430# #Remove all rows which have any NA
# #na.omit(), complete.cases(), tidyr::drop_na(), rowSums(is.na())
bb_1 <- na.omit(bb)
# #Print the Count of removed rows containg NA
print(paste0("Note: ", length(attributes(bb_1)$na.action), " rows removed."))
## [1] "Note: 9430 rows removed."
#
# #Remove additional Attribute added by na.omit()
attr(bb_1, "na.action") <- NULL
#
# #Equivalent
bb_2 <- bb[complete.cases(bb), ]
bb_3 <- bb %>% drop_na()
bb_4 <- bb[rowSums(is.na(bb)) == 0, ]
#Validation
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))
#
# #complete.cases also allow partial selection of specific columns
# #Remove rows which have NA in some columns i.e. ignore NA in other columns
dim(bb[complete.cases(bb[ , c(6, 9, 4)]), ])
## [1] 327346 19
# #Equivalent
dim(bb %>% drop_na(dep_delay, arr_delay, dep_time))
## [1] 327346 19
#
# #Remove rows which have more than allowed number of NA (ex:4) in any column
# #Caution: In general, this is not recommended because random columns retain NA
dim(bb[rowSums(is.na(bb)) <= 4L, ])
## [1] 328521 19Sources: Grouping Functions and the Apply Family, Why is vapply safer than sapply, Hadley - Advanced R - Functionals, This, This, & This
Apply Function in R are designed to avoid explicit use of loop constructs.
# #Subset Dataframe
bb <- xxflights
data_8 <- bb[ , c("dep_delay", "arr_delay", "dep_time")]
#data_8 <- bb %>% select(dep_delay, arr_delay, dep_time)
#
# #Remove missing values
data_9 <- na.omit(data_8)
#
# #Calculate Columnwise Mean
(bb_1 <- apply(data_9, 2, mean))
## dep_delay arr_delay dep_time
## 12.555156 6.895377 1348.789883
bb_2 <- unlist(lapply(data_9, mean))
bb_3 <- sapply(data_9, mean)
bb_4 <- vapply(data_9, mean, numeric(1))
#
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))Refer The 6 Datatypes of Atomic Vectors
Create a Basic Tibble, Table2.1, for evaluating ‘is.x()’ series of functions in Base R
| ii | dd | cc | ll | ff | fo | dtm | dat |
|---|---|---|---|---|---|---|---|
| 1 | 1 | a | FALSE | odd | odd | 2021-12-30 18:07:04 | 2021-12-31 |
| 2 | 2 | b | TRUE | even | even | 2021-12-30 18:07:05 | 2022-01-01 |
| 3 | 3 | c | FALSE | odd | odd | 2021-12-30 18:07:06 | 2022-01-02 |
| 4 | 4 | d | TRUE | even | even | 2021-12-30 18:07:07 | 2022-01-03 |
| 5 | 5 | e | FALSE | odd | odd | 2021-12-30 18:07:08 | 2022-01-04 |
| 6 | 6 | f | TRUE | even | even | 2021-12-30 18:07:09 | 2022-01-05 |
# #Validation
# #anyNA() is TRUE if there is an NA present, FALSE otherwise
vapply(bb, anyNA, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
vapply(bb, is.atomic, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#
# #is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
# #CAUTION: With vapply() it returns TRUE for matrix (it checks individual elements)
# #CAUTION: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
vapply(bb, is.vector, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#
# #is.numeric() is TRUE for both integer and double
vapply(bb, is.numeric, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.integer() is TRUE only for integer
vapply(bb, is.integer, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.double() is TRUE only for double
vapply(bb, is.double, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
#
# #is.character() is TRUE only for character
vapply(bb, is.character, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#
# #is.logical() is TRUE only for logical
vapply(bb, is.logical, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE# #Factors
# #is.factor() is TRUE only for factor (unordered or ordered)
vapply(bb, is.factor, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#
# #is.ordered() is TRUE only for ordered factor
vapply(bb, is.ordered, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#
# #nlevels()
vapply(bb, nlevels, integer(1))
## ii dd cc ll ff fo dtm dat
## 0 0 0 0 2 2 0 0
#
# #levels()
vapply(bb, function(x) !is.null(levels(x)), logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#
# #table()
table(bb$ff)
##
## even odd
## 3 3# #Package lubridate covers the missing functions for POSIXct, POSIXlt, or Date
# #is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
vapply(bb, is.timepoint, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#
# #is.POSIXt() is TRUE only for POSIXct
vapply(bb, is.POSIXt, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#
# #is.Date() is only TRUE for DATE
vapply(bb, is.Date, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE# #Which Columns have Duplicate Values
vapply(bb, function(x) anyDuplicated(x) != 0L, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSEThey can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, Female” and “True, False” etc. They are useful in data analysis for statistical modelling.
Factor is nothing but the numeric representation of the character vector.
as.factor() vs. factor()
str(bb$ll)
## logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
# #Coercion to Factor
bb$new <- as.factor(bb$ll)
str(bb$new)
## Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
#
# #table()
table(bb$ll)
##
## FALSE TRUE
## 3 3
table(bb$new)
##
## FALSE TRUE
## 3 3
#
# #Levels can be Labelled differently also
str(bb$ff)
## Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
# #
str(factor(bb$ff, levels = c("even", "odd"), labels = c("day", "night")))
## Factor w/ 2 levels "day","night": 2 1 2 1 2 1
str(factor(bb$ff, levels = c("odd", "even"), labels = c("day", "night")))
## Factor w/ 2 levels "day","night": 1 2 1 2 1 2
#
# #Coercion from Factor to character, logical etc.
bb$xcc <- as.character(bb$new)
bb$xll <- as.logical(bb$new)
#
str(bb)
## tibble [6 x 11] (S3: tbl_df/tbl/data.frame)
## $ ii : int [1:6] 1 2 3 4 5 6
## $ dd : num [1:6] 1 2 3 4 5 6
## $ cc : chr [1:6] "a" "b" "c" "d" ...
## $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
## $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
## $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
## $ dtm: POSIXct[1:6], format: "2021-12-30 18:07:04" "2021-12-30 18:07:05" "2021-12-30 18:07:06" ...
## $ dat: Date[1:6], format: "2021-12-31" "2022-01-01" "2022-01-02" ...
## $ new: Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
## $ xcc: chr [1:6] "FALSE" "TRUE" "FALSE" "TRUE" ...
## $ xll: logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUEbb <- xxflights
aa <- c("month", "day")
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
## $ month: num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
# #To factor
bb$day <- as.factor(bb$day)
bb$month <- as.factor(bb$month)
# #Equivalent
#bb[aa] <- lapply(bb[aa], as.factor)
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
## $ month: Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...Caution: The only thing you need to take care of, is that you do not give two elements the same name. R will NOT throw ERROR.
Due to the resulting two-dimensional structure, data frames can mimic some of the behaviour of matrices. You can select rows and do operations on rows. You cannot do that with lists, as a row is undefined there.
A Dataframe is intended to be used as a relational table. This means that elements in the same column are related to each other in the sense that they are all measures of the same metric. And, elements in the same row are related to each other in the sense that they are all measures from the same observation or measures of the same item. This is why when you look at the structure of a Dataframe, it will state the the number of observations and the number of variables instead of the number of rows and columns.
Dataframes are distinct from Matrices because they can include heterogenous data types among columns/variables. Dataframes do not permit multiple data types within a column/variable, for reasons that also follow from the relational table idea.
All this implies that you should use a data frame for any dataset that fits in that two-dimensional structure. Essentially, you use data frames for any dataset where a column coincides with a variable and a row coincides with a single observation in the broad sense of the word. For all other structures, lists are the way to go.
# #CAUTION: Do not Create a list with duplicate names (R will NOT throw ERROR)
bb <- list(a=1, b=2, a=3)
# # 3rd index cannot be accessed using $
bb$a
## [1] 1
identical(bb$a, bb[[1]])
## [1] TRUE
identical(bb$a, bb[[3]])
## [1] FALSE
bb[[3]]
## [1] 3# #Create a list
bb_lst <- list( a = c(1, 2), b = c('a', 'b', 'c'))
tryCatch(
# #Trying to create varying length of variables in dataframe like in list
bb_dft <- data.frame(a = c(1, 2), b = c('a', 'b', 'c')),
error = function(e) {
# #Print ERROR
cat(paste0(e))
# #Double Arrow Assignment '<<-' to assign in parent environment
bb_dft <<- data.frame(a = c(1, 2), b = c('a', 'b'))
}
)
## Error in data.frame(a = c(1, 2), b = c("a", "b", "c")): arguments imply differing number of rows: 2, 3
#
# #Both list and dataframe have same type()
typeof(bb_lst)
## [1] "list"
typeof(bb_dft)
## [1] "list"
#
# #But, class() is different for list and dataframe
class(bb_lst)
## [1] "list"
class(bb_dft)
## [1] "data.frame"
#
str(bb_lst)
## List of 2
## $ a: num [1:2] 1 2
## $ b: chr [1:3] "a" "b" "c"
str(bb_dft)
## 'data.frame': 2 obs. of 2 variables:
## $ a: num 1 2
## $ b: chr "a" "b"
#
# #Although 'bb_lst_c' is a list but inside coercion takes place i.e. '9' is character
bb_lst_c <- list( a = c(8, 'x'), b = c('y', 9))
str(bb_lst_c[[2]][2])
## chr "9"
#
# #Here, '9' is numeric, it is stored as list element so note the extra [[]]
bb_lst_l <- list( a = list(8, 'x'), b = list('y', 9))
str(bb_lst_l[[2]][[2]])
## num 9# #Create a Matrix
bb_mat <- matrix(1:6, nrow = 2, ncol = 3)
print(bb_mat)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
str(bb_mat)
## int [1:2, 1:3] 1 2 3 4 5 6
class(bb_mat)
## [1] "matrix" "array"
typeof(bb_mat)
## [1] "integer"# #Basic Tibble
bb <- xxbasic10
str(bb)
## tibble [6 x 8] (S3: tbl_df/tbl/data.frame)
## $ ii : int [1:6] 1 2 3 4 5 6
## $ dd : num [1:6] 1 2 3 4 5 6
## $ cc : chr [1:6] "a" "b" "c" "d" ...
## $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
## $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
## $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
## $ dtm: POSIXct[1:6], format: "2021-12-30 18:07:04" "2021-12-30 18:07:05" "2021-12-30 18:07:06" ...
## $ dat: Date[1:6], format: "2021-12-31" "2022-01-01" "2022-01-02" ...
# #Split with 'cc' as common ID column
bb_a <- bb[1:3]
bb_b <- bb[3:ncol(bb)]
#
# #merge() using the common ID column 'cc'
bb_new <- merge(bb_a, bb_b, by = "cc")
bb_new
## cc ii dd ll ff fo dtm dat
## 1 a 1 1 FALSE odd odd 2021-12-30 18:07:04 2021-12-31
## 2 b 2 2 TRUE even even 2021-12-30 18:07:05 2022-01-01
## 3 c 3 3 FALSE odd odd 2021-12-30 18:07:06 2022-01-02
## 4 d 4 4 TRUE even even 2021-12-30 18:07:07 2022-01-03
## 5 e 5 5 FALSE odd odd 2021-12-30 18:07:08 2022-01-04
## 6 f 6 6 TRUE even even 2021-12-30 18:07:09 2022-01-05bb <- xxflights
# #Sort ascending (default)
bb_1 <- bb[order(bb$dep_delay), ]
# #Sort descending
bb_2 <- bb[order(-bb$dep_delay), ]
#
bb[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 517 2 N14228 UA
## 2 533 4 N24211 UA
## 3 542 2 N619AA AA
## 4 544 -1 N804JB B6
## 5 554 -6 N668DN DL
bb_1[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 2040 -43 N592JB B6
## 2 2022 -33 N612DL DL
## 3 1408 -32 N825AS EV
## 4 1900 -30 N934DL DL
## 5 1703 -27 N208FR F9
bb_2[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 641 1301 N384HA HA
## 2 1432 1137 N504MQ MQ
## 3 1121 1126 N517MQ MQ
## 4 1139 1014 N338AA AA
## 5 845 1005 N665MQ MQbb <- xxbasic10
bb
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 1 1 a FALSE odd odd 2021-12-30 18:07:04 2021-12-31
## 2 2 2 b TRUE even even 2021-12-30 18:07:05 2022-01-01
## 3 3 3 c FALSE odd odd 2021-12-30 18:07:06 2022-01-02
## 4 4 4 d TRUE even even 2021-12-30 18:07:07 2022-01-03
## 5 5 5 e FALSE odd odd 2021-12-30 18:07:08 2022-01-04
## 6 6 6 f TRUE even even 2021-12-30 18:07:09 2022-01-05
# #Sort ascending (default)
(bb_1 <- bb[order(bb$ll), ])
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 1 1 a FALSE odd odd 2021-12-30 18:07:04 2021-12-31
## 2 3 3 c FALSE odd odd 2021-12-30 18:07:06 2022-01-02
## 3 5 5 e FALSE odd odd 2021-12-30 18:07:08 2022-01-04
## 4 2 2 b TRUE even even 2021-12-30 18:07:05 2022-01-01
## 5 4 4 d TRUE even even 2021-12-30 18:07:07 2022-01-03
## 6 6 6 f TRUE even even 2021-12-30 18:07:09 2022-01-05
# #Sort on Multiple Columns with ascending and descending
(bb_2 <- bb[order(bb$ll, -bb$dd), ])
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 5 5 e FALSE odd odd 2021-12-30 18:07:08 2022-01-04
## 2 3 3 c FALSE odd odd 2021-12-30 18:07:06 2022-01-02
## 3 1 1 a FALSE odd odd 2021-12-30 18:07:04 2021-12-31
## 4 6 6 f TRUE even even 2021-12-30 18:07:09 2022-01-05
## 5 4 4 d TRUE even even 2021-12-30 18:07:07 2022-01-03
## 6 2 2 b TRUE even even 2021-12-30 18:07:05 2022-01-01
#
stopifnot(identical(bb_2, arrange(bb, ll, -dd)))“Understanding basic data manipulation using R”
Links (Ref)
# #To get the Help files on any Topic including 'loaded' Packages
?dplyr
?mutate
# #To get the Help files on any Topic including functions from 'not loaded' Packages
?dplyr::mutate()
# #Operators need Backticks i.e. ` . In keyboards it is located below 'Esc' Key
?`:`Overview
TRUE, FALSE, or NA.# #At lease one TRUE is present
NA | TRUE
## [1] TRUE
# #Depending upon what the unknown is, the outcome will change
NA | FALSE
## [1] NA
# #Depending upon what the unknown is, the outcome will change
NA & TRUE
## [1] NA
# #At lease one FALSE is present
NA & FALSE
## [1] FALSE
#
# #For length 1 vectors, output of vectorised and non-vectorised forms is same
stopifnot(all(identical(NA || TRUE, NA | TRUE), identical(NA || FALSE, NA | FALSE),
identical(NA && TRUE, NA & TRUE), identical(NA && FALSE, NA & FALSE)))
#
# #But for vectors of >1 length, output is different
x <- 1:5
y <- 5:1
(x > 2) & (y < 3)
## [1] FALSE FALSE FALSE TRUE TRUE
(x > 2) && (y < 3)
## [1] FALSE
#
# # '&&' evaluates only the first element of Vector, thus caution is advised
TRUE & c(TRUE, FALSE)
## [1] TRUE FALSE
TRUE & c(FALSE, FALSE)
## [1] FALSE FALSE
TRUE && c(TRUE, FALSE)
## [1] TRUE
TRUE && c(FALSE, FALSE)
## [1] FALSE
TRUE && all(c(TRUE, FALSE))
## [1] FALSE
TRUE && any(c(TRUE, FALSE))
## [1] TRUEif(exists("x")) rm(x)
exists("x")
## [1] FALSE
#
# # No short-circuit for "|" or "&", Evaluates Right and throws Error
tryCatch( TRUE | x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
tryCatch( FALSE & x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
#
# #Does not evaluate Right input because outcome already determined
tryCatch( TRUE || x, error = function(e) cat(paste0(e)))
## [1] TRUE
tryCatch( FALSE && x, error = function(e) cat(paste0(e)))
## [1] FALSE
# #evaluates Right input because outcome cannot be determined and throws error
tryCatch( TRUE && x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found# #any()
any(NA, TRUE)
## [1] TRUE
any(NA, FALSE)
## [1] NA
any(NA, TRUE, na.rm = TRUE)
## [1] TRUE
any(NA, FALSE, na.rm = TRUE)
## [1] FALSE
any(character(0))
## [1] FALSE
#
# #all()
all(NA, TRUE)
## [1] NA
all(NA, FALSE)
## [1] FALSE
all(NA, TRUE, na.rm = TRUE)
## [1] TRUE
all(NA, FALSE, na.rm = TRUE)
## [1] FALSE
all(character(0))
## [1] TRUE\(>\) , \(<\) , \(==\) , \(>=\) , \(<=\) , \(!=\)
# #dplyr::filter() - Filter Rows based on Multiple Columns
bb_1 <- filter(bb, month == 1, day == 1)
dim(bb_1)
## [1] 842 19
# #Filtering by multiple criteria within a single logical expression
stopifnot(identical(bb_1, filter(bb, month == 1 & day == 1)))
#
if(anyNA(bb_1)) {
bb_na <- na.omit(bb_1)
print(paste0("Note: ", length(attributes(bb_na)$na.action), " rows removed."))
} else {
print("NA not found")
}
## [1] "Note: 11 rows removed."
dim(bb_na)
## [1] 831 19dim(bb)
## [1] 336776 19
#
# #Flights in either months of November or Decemeber
dim(bb_2 <- filter(bb, month == 11 | month == 12))
## [1] 55403 19
#
# #Flights with arrival delay '<= 120' or departure delay '<= 120'
# #It excludes flights where arrival & departure BOTH are delayed by >2 hours
# #If either delay is less than 2 hours, the flight is included
dim(bb_3 <- filter(bb, arr_delay <= 120 | dep_delay <= 120))
## [1] 320060 19
dim(bb_4 <- filter(bb, !(arr_delay > 120 & dep_delay > 120)))
## [1] 320060 19
dim(bb_5 <- filter(bb, (!arr_delay > 120 | !dep_delay > 120)))
## [1] 320060 19
#
# #Destination to IAH or HOU
dim(bb_6 <- filter(bb, dest == "IAH" | dest == "HOU"))
## [1] 9313 19
dim(bb_7 <- filter(bb, dest %in% c("IAH", "HOU")))
## [1] 9313 19
#
# #Carrier being "UA", "US", "DL"
dim(bb_8 <- filter(bb, carrier == "UA" | carrier == "US" | carrier == "DL"))
## [1] 127311 19
dim(bb_9 <- filter(bb, carrier %in% c("UA", "US", "DL")))
## [1] 127311 19
#
# #Did not leave late (before /on time departure) but Arrived late by >2 hours
dim(bb_10 <- filter(bb, (arr_delay > 120) & !(dep_delay > 0)))
## [1] 29 19
#
# #Departed between midnight and 6 AM (inclusive)
dim(bb_11 <- filter(bb, (sched_dep_time >= 00 & sched_dep_time <= 600)))
## [1] 8970 19# #subset() - Recommendation is against its usage. Use either '[]' or filter()
dim(bb_12 <- subset(bb, month == 1 | !(dep_delay >= 120),
select = c("flight", "arr_delay")))
## [1] 319760 2
dim(bb_13 <- subset(bb, month == 1 | !(dep_delay >= 120) | carrier == "DL",
select = c("flight", "arr_delay")))
## [1] 321139 2\([ \ \ ]\) , \([[ \ \ ]]\) , \(\$\)
dplyr::select()
dim(bb)
## [1] 336776 19
#
# #Subset Consecutive Columns using Colon
stopifnot(identical(bb[ , 2:5], bb[ , -c(1, 6:ncol(bb))]))
#
# #dplyr::select()
bb_14 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_15 <- bb %>% select(year:day, arr_delay, dep_delay, distance, air_time)
stopifnot(identical(bb_14, bb_15))dim(bb)
## [1] 336776 19
#
# #Subset Rows
dim(bb[which(bb$day == 1 & !(bb$month ==1)), ])
## [1] 10194 19
dim(bb[which(bb$day == 1 | bb$month ==1), ])
## [1] 37198 19
dim(bb[which(bb$day == 1 & bb$month ==1), ])
## [1] 842 19
dim(bb[which(bb$day == 1, bb$month ==1), ])
## [1] 11036 19
dim(bb[which(bb$day == 1 & !(bb$carrier == "DL")), ])
## [1] 9482 19
dim(bb[which(bb$day == 1 | bb$carrier == "DL"), ])
## [1] 57592 19
dim(bb[which(bb$day == 1 & bb$carrier == "DL"), ])
## [1] 1554 19
dim(bb[which(bb$day == 1, bb$carrier == "DL"), ])
## [1] 11036 19bb <- xxflights
# #dplyr::summarise() & dplyr::summarize() are same
# #Get the mean of a column with NA excluded
#
summarize(bb, delay_mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay_mean
## <dbl>
## 1 12.6
#
# #base::summary()
summary(bb$dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43.00 -5.00 -2.00 12.64 11.00 1301.00 8255
#
# #Grouped Summary
by_ymd <- group_by(bb, year, month, day)
mysum <- summarize(by_ymd,
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups = "keep")
# #Equivalent
bb %>%
group_by(year, month, day) %>%
summarize(dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups= "keep")
## # A tibble: 365 x 5
## # Groups: year, month, day [365]
## year month day dep_delay_mean arr_delay_mean
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 11.5 12.7
## 2 2013 1 2 13.9 12.7
## 3 2013 1 3 11.0 5.73
## 4 2013 1 4 8.95 -1.93
## 5 2013 1 5 5.73 -1.53
## 6 2013 1 6 7.15 4.24
## 7 2013 1 7 5.42 -4.95
## 8 2013 1 8 2.55 -3.23
## 9 2013 1 9 2.28 -0.264
## 10 2013 1 10 2.84 -5.90
## # ... with 355 more rows# #Get delay grouped by distance 'Distance between airports, in miles.'
summary(bb$distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17 502 872 1040 1389 4983
#
# #How many unique values are present in this numeric data i.e. factors
str(as.factor(bb$distance))
## Factor w/ 214 levels "17","80","94",..: 163 165 145 171 106 96 138 22 120 99 ...
str(sort(unique(bb$distance)))
## num [1:214] 17 80 94 96 116 143 160 169 173 184 ...
bb %>%
group_by(distance) %>%
summarize(count = n(),
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups= "keep")
## # A tibble: 214 x 4
## # Groups: distance [214]
## distance count dep_delay_mean arr_delay_mean
## <dbl> <int> <dbl> <dbl>
## 1 17 1 NaN NaN
## 2 80 49 18.9 16.5
## 3 94 976 17.5 12.7
## 4 96 607 3.19 5.78
## 5 116 443 17.7 7.05
## 6 143 439 23.6 14.4
## 7 160 376 21.8 16.2
## 8 169 545 18.5 15.1
## 9 173 221 7.05 -0.286
## 10 184 5504 3.07 0.123
## # ... with 204 more rows
#
# #For distance =17, there is only 1 flight and that too has NA, so the mean is NaN
bb[bb$distance == 17, ]
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013 7 27 NA 106 NA NA 245 NA US
## # ... with 9 more variables: flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#
# #In general, Flight to any destination (ex: ABQ) has travelled same distance (1826)
unique(bb %>% filter(dest == "ABQ") %>% select(distance))
## # A tibble: 1 x 1
## distance
## <dbl>
## 1 1826
#
# #Mean Delays for Destinations with more than 1000 miles distance
bb %>%
group_by(dest) %>%
filter(distance > 1000) %>%
summarize(count = n(),
distance_mean = mean(distance, na.rm = TRUE),
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE))
## # A tibble: 48 x 5
## dest count distance_mean dep_delay_mean arr_delay_mean
## <chr> <int> <dbl> <dbl> <dbl>
## 1 ABQ 254 1826 13.7 4.38
## 2 ANC 8 3370 12.9 -2.5
## 3 AUS 2439 1514. 13.0 6.02
## 4 BQN 896 1579. 12.4 8.25
## 5 BUR 371 2465 13.5 8.18
## 6 BZN 36 1882 11.5 7.6
## 7 DEN 7266 1615. 15.2 8.61
## 8 DFW 8738 1383. 8.68 0.322
## 9 DSM 569 1021. 26.2 19.0
## 10 EGE 213 1736. 15.5 6.30
## # ... with 38 more rowsdim(bb)
## [1] 336776 19
#
bb_16 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_17 <- mutate(bb_16,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
# #Equivalent
bb %>%
select(year:day, arr_delay, dep_delay, distance, air_time) %>%
mutate(gain = arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
## year month day arr_delay dep_delay distance air_time gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 11 2 1400 227 9 370. 3.78 2.38
## 2 2013 1 1 20 4 1416 227 16 374. 3.78 4.23
## 3 2013 1 1 33 2 1089 160 31 408. 2.67 11.6
## 4 2013 1 1 -18 -1 1576 183 -17 517. 3.05 -5.57
## 5 2013 1 1 -25 -6 762 116 -19 394. 1.93 -9.83
## 6 2013 1 1 12 -4 719 150 16 288. 2.5 6.4
## 7 2013 1 1 19 -5 1065 158 24 404. 2.63 9.11
## 8 2013 1 1 -14 -3 229 53 -11 259. 0.883 -12.5
## 9 2013 1 1 -8 -3 944 140 -5 405. 2.33 -2.14
## 10 2013 1 1 8 -2 733 138 10 319. 2.3 4.35
## # ... with 336,766 more rows9.20 A population is the set of all elements of interest in a particular study.
9.23 The process of conducting a survey to collect data for the entire population is called a census.
9.21 A sample is a subset of the population.
15.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.
9.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.
Inferential statistics are used for Hypothesis Testing. Refer Statistical Inference
Refer Hypothesis Testing
17.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.
17.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
17.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.
Refer Steps of Hypothesis Testing
17.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
17.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
17.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
Question: Is there an ideal sample size
15.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
15.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
15.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.
Example: \({\overline{x}}\) is an estimator (of populataion parameter ‘mean’ \({\mu}\)). Its estimate is 3 and this calculation process is an estimation.
11.6 Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).
Refer Standard Deviation and equation (11.12)
11.12 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align}\]
A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
Refer Variance and equation (11.11)
11.11 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align}\]
Variability is most commonly measured with the Range, IQR, SD, and Variance.
The sample we draw from the population is only one from a large number of potential samples.
Refer Standard Error
15.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
15.14 A sampling error is the difference between a population parameter and a sample statistic.
Sampling fluctuation (Standard Error) refers to the extent to which a statistic (mean, median, mode, sd etc.) takes on different values with different samples i.e. it refers to how much the value of the statistic fluctuates from sample to sample.
15.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).
Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by equation (15.1) i.e. \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)
Refer Test Statistic
17.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.
17.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (17.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)
Standard Error (SE) is same as ‘the standard deviation of the sampling distribution.’ The ‘variance of the sampling distribution’ is the Variance of the data divided by N.
# #DataSet: Height of 5 people in 'cm'
hh <- c(170.5, 161, 160, 170, 150.5)
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #Mean by mean()
hh_mean <- mean(hh)
cat("Mean = ", hh_mean)
## Mean = 162.4
#
# #Variance by var()
hh_var <- round(var(hh), 3)
cat("Variance = ", hh_var)
## Variance = 68.175
#
# #Standard Deviation (SD) by sd()
hh_sd <- round(sd(hh), 3)
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) = 8.257
#
# #Standard Error (SE)
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) = 3.693# #DataSet: Height of 5 people in 'cm'
print(hh)
## [1] 170.5 161.0 160.0 170.0 150.5
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #sum by sum()
print(hh_sum <- sum(hh))
## [1] 812
#
# #Mean by mean()
hh_mean <- mean(hh)
hh_mean_cal <- hh_sum / hh_len
stopifnot(identical(hh_mean, hh_mean_cal))
cat("Mean = ", hh_mean)
## Mean = 162.4
#
# #Calculate the deviation from the mean by subtracting each value from the mean
print(hh_dev <- hh - hh_mean)
## [1] 8.1 -1.4 -2.4 7.6 -11.9
#
# #Square the deviation
print(hh_sqdev <- hh_dev^2)
## [1] 65.61 1.96 5.76 57.76 141.61
#
# #Get Sum of the squared deviations
print(hh_sqdev_sum <- sum(hh_sqdev))
## [1] 272.7
#
# #Divide it by the 'sample size (N) – 1' for the Variance or use var()
hh_var <- round(var(hh), 3)
hh_var_cal <- hh_sqdev_sum / (hh_len -1)
stopifnot(identical(hh_var, hh_var_cal))
cat("Variance = ", hh_var)
## Variance = 68.175
#
# #Variance of the sampling distribution
hh_var_sample <- hh_var / hh_len
cat("Variance of the Sampling Distribution = ", hh_var)
## Variance of the Sampling Distribution = 68.175
#
# #Take square root of the Variance for the Standard Deviation (SD) or use sd()
hh_sd_cal <- round(sqrt(hh_var), 3)
hh_sd <- sd(hh)
stopifnot(identical(round(hh_sd, 3), hh_sd_cal))
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) = 8.256815
#
# #Standard Error (SE)
# #SE
# #Divide the SD by the square root of the sample size for the Standard Error (SE)
# #
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
#
# #Calculate SE from Variance
hh_se_var <- round(sqrt(hh_var_sample), 3)
stopifnot(identical(hh_se_sd, hh_se_var))
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) = 3.693Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes. Refer figure 4.1
Figure 4.1 Flights: Air Time (min) excluding NA (Histogram and Density)
# #Remove All NA
aa <- na.omit(xxflights$air_time)
attr(aa, "na.action") <- NULL
str(aa)
## num [1:327346] 227 227 160 183 116 150 158 53 140 138 ...
summary(aa)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 82.0 129.0 150.7 192.0 695.0# #Overview of Data after removal of NA
bb <- aa
stopifnot(is.null(dim(bb)))
summary(bb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 82.0 129.0 150.7 192.0 695.0
# #min(), max(), range(), summary()
min_bb <- summary(bb)[1]
max_bb <- summary(bb)[6]
range_bb <- max_bb - min_bb
cat(paste0("Range = ", range_bb, " (", min_bb, ", ", max_bb, ")\n"))
## Range = 675 (20, 695)
# #IQR(), summary()
iqr_bb <- IQR(bb)
cat(paste0("IQR = ", iqr_bb, " (", summary(bb)[2], ", ", summary(bb)[5], ")\n"))
## IQR = 110 (82, 192)
# #median(), mean(), summary()[3], summary()[4]
median_bb <- median(bb)
cat("Median =", median_bb, "\n")
## Median = 129
mu_mean_bb <- mean(bb)
cat("Mean \u03bc =", mu_mean_bb, "\n")
## Mean µ = 150.6865
#
sigma_sd_bb <- sd(bb)
cat("SD (sigma) \u03c3 =", sigma_sd_bb, "\n")
## SD (sigma) s = 93.6883
#
variance_bb <- var(bb)
cat(sprintf('Variance (sigma)%s %s%s =', '\u00b2', '\u03c3', '\u00b2'), variance_bb, "\n")
## Variance (sigma)² s² = 8777.498# #Histogram
bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = bb)
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B12P01 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
geom_text(data = tibble(x = mean_hh, y = -Inf,
label = paste0("Mean= ", mean_hh)),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
geom_text(data = tibble(x = median_hh, y = -Inf,
label = paste0("Median= ", median_hh)),
aes(x = x, y = y, label = label),
color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh,
")"),
caption = "B12P01", title = "Flights: Air Time")
}# #Density Curve
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B12P02 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
#scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
ggplot2::annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = xpoints_hh[2], y = 0.007,
yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = xpoints_hh[6], y = 0.007,
yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.007, label = "95.4%") +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Density",
subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1),
"; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
")"),
caption = "B12P02", title = "Flights: Air Time")
}Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes.
Caution: Trend here does not match with the theory. However, the exercise shows the ‘How to do it’ part. It can be repeated with better data, larger sample size, or repeat sampling.
Figure 4.2 Effect of Increasing Sample Size
Figure 4.3 Effect of Increasing Sample Size
bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed()
set.seed(3)
# #Set Sample Size
#nn <- 100L
# #Take a sample from dataset
xb100 <- sample(bb, size = 100L)
xb1000 <- sample(bb, size = 1000L)
xb10000 <- sample(bb, size = 10000L)
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 100
hh <- tibble(ee = xb100)
ylim_hh <- 12.5
caption_hh <- "B12P03"# #Assumes 'hh' has data in 'ee'. In: mu_hh, caption_hh, ylim_hh
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = 2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) +
geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) +
coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(Mean= ", round(mean(.[[1]]), 1),
"; SD= ", round(sd(.[[1]]), 1),
#"; Var= ", round(var(.[[1]]), 1),
"; SE= ", round(sd(.[[1]]) / sqrt(nrow(.)), 1),
")"),
caption = caption_hh, title = paste0("Sample Size = ", nrow(.)))
}
assign(caption_hh, B12)
rm(B12)Figure 4.4 Effect of Increasing Sample Size
Figure 4.5 Effect of Increasing Sampling
bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed()
set.seed(3)
# #Set Sample Size
nn <- 10L
# #Set Repeat Sampling Rate
rr <- 20L
# #Take Sample of N = 10, get mean, repeat i.e. get distribution of mean
xr20 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 200L
xr200 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 2000L
xr2000 <- replicate(rr, mean(sample(bb, size = nn)))
#
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 10, Repeat = 20
hh <- tibble(ee = xr20)
ylim_hh <- 2
caption_hh <- "B12P06"# #Assumes 'hh' has data in 'ee'. In: mu_hh, caption_hh, ylim_hh, nn
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
geom_text(aes(label = TeX(r'($E(\bar{x})$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = 1.5, vjust = -1.5, parse = TRUE, check_overlap = TRUE) +
geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) +
coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) +
theme(plot.title.position = "panel") +
labs(x = TeX(r'($\bar{x} \, (\neq x)$)'), y = TeX(r'(Frequency of $\, \bar{x}$)'),
subtitle = TeX(sprintf(
"($\\mu$=%.0f) $E(\\bar{x}) \\, =$%.0f; $\\sigma_{\\bar{x}} \\, =$%.0f",
mu_hh, round(mean(.[[1]]), 1), round(sd(.[[1]])))),
caption = caption_hh,
title = paste0("Sampling Distribution (N = ", nn, ") & Repeat Sampling = ", nrow(.)))
}
assign(caption_hh, B12)
rm(B12)Refer Normal Distribution and equation (14.2)
14.3 A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
Their importance is partly due to the Central Limit Theorem. Assumption of normal distribution allow us application of Parametric Methods
26.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).
15.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.
It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases.
Parametric statistical tests typically assume that samples come from normally distributed populations, but the central limit theorem means that this assumption is not necessary to meet when you have a large enough sample. A sample size of 30 or more is generally considered large.
This is the basis of Empirical Rule.
11.20 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
Caution: If data from small samples do not closely follow this pattern, then other distributions like the t-distribution may be more appropriate.
Refer Standard Normal and equation (14.3)
14.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)
The simplest case of a normal distribution is known as the standard normal distribution. Given the Population with normal distribution \({\mathcal{N}}_{(\mu, \, \sigma)}\)
If \(\overline{X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \(\sigma/{\sqrt{n}}\) and thus the z-score is \(Z=\frac {\overline{X} - \mu}{\sigma/{\sqrt{n}}}\)
The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.
The area under the curve to the right of a z-score is the p-value, and it is the likelihood of your observation occurring if the null hypothesis is true.
Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.
Refer Outliers: C03
11.21 Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
Figure 4.6 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
Example
Since we are using sample data to make inferences about the population, it is possible that we will make an error. In the case of the Null Hypothesis, we can make one of two errors.
Refer Type I and Type II Errors
17.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
17.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
17.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
16.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
17.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
There is always a tradeoff between Type-I and Type-II errors.
In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.
17.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).”
Figure 4.7 Left Tail vs. Right Tail
Figure 4.8 Two Tail
17.18 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.2816 1.6449 1.9600 2.3263 2.575817.16 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).
17.19 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.
17.21 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.
17.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.
One tailed-tests are concerned with one side of a statistic. Whereas, Two-tailed tests deal with both tails of the distribution.
Two-tail test is done when you do not know about direction, so you test for both sides.
17.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
17.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
17.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
17.15 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
Steps for the p-value approach or test statistic approach
17.17 The critical value approach requires that we first determine a value for the test statistic called the critical value.
Steps for the critical value approach
If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.
Question: Does the probability from t-table differ from the probability value from z-table
It is assumed that \((\overline{x} - \mu)\) follows Normality. However the Standard Error (SE) does not follow normality, generally it follows chi-sq distribution. Thus, \((\overline{x} - \mu)/SE\) becomes ‘Normal/ChiSq’ and this ratio follows the t-distribution. Thus, the test we apply is called t-test.
# #For Degrees of Freedom = 10 (N=11)
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
dof <- 10L
#
# #Left Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.3722 -1.8125 -2.2281 -2.7638 -3.1693
#
# #Right Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.3722 1.8125 2.2281 2.7638 3.169316.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).
Why \((n-1)\) are the degrees of freedom
Question: Is there any minimum sample size we must consider before calculating degrees of freedom
Guess: Degrees of freedom is also calculated to remove the possible bias
Figure 5.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
Refer Type I and Type II Errors (B12) & Type I and Type II Errors
17.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
17.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
17.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
16.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
17.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
17.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
17.23 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret
15.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
Population Size = 100, \({\alpha} = 0.05\)
Hypothesis: \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq 22 \iff {H_a}: {\mu} > 22\)
Sample: n=4, dof = 3, \({\overline{x}} = 23\)
Sample: n=10, dof = 9, \({\overline{x}} = 23\)
We know if we take another sample, we will have a different sample mean. So, we need to confirm whether the above calculated sample mean \({\overline{x}} = 23\) represent the population mean \({\mu}\) i.e. Can we reject or fail to reject \({H_0}\) based on this sample!
3 Approaches for Hypothesis Testing -
If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.
2-T Rule of Thumb - Skipped “09:55”
# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)Example:
# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)18.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
18.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
Example:
18.6 Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.
18.7 Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.
Test Statistic for Independent Sample t-Test Statistic is given by (18.9) as shown below
\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}}\]
The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.
A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.
Example: If we want to evaluate effect of a Training Program.
We can take two samples of 50 people each. First Set “Untrained” would be from the set of people who did not receive training. Second Set “Trained” would be from the set of people who have undergone the training. Comparison of these two sample mean performances would be done by “independent sample” t-test.
Or
We can take a sample of 50 “Untrained” people. Get their mean performance. Provide the training of these 50 people. Then again get their mean performance. Now, we have “paired” samples of performances of same people. One set has their performance before the training and another is after the training. Comparison of these two sample mean performances would be done by “paired sample” t-test.
Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a “repeated measures” t-test).
The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.
Assume there are 3 samples A, B, C. We can do \(C_2^3 = 3\) number of tests i.e. \(\{(A, B), (B, C), (C, A)\}\). However, assuming \({\alpha} = 0.05 \iff {\gamma} = 0.95\) for each test, the confidence for 3 consecutive tests become \({\gamma}^3 = 0.857 \iff {\alpha} = 0.143\), which is a very high and unacceptable value. To avoid this, we use ANOVA as a single test.
High value of F-test would indicate that the populations are different.
17.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (17.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)
17.24 If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (17.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)
Please import the WSES data in xxWSES object. Due to copyright, it has not been shared.
17.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Caution: While importing data, for Mac users, probably it will be easier to import CSV file. However, I am not a Mac user so cannot comment on this.
# #Import the Data and assign to a temporary variable for ease of use
xxWSES <- f_getRDS(xxWSES)
bb <- xxWSES
str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Opportunity No. : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
## $ Reporting Status : chr [1:1000] "Lost" "Won" "Lost" "Won" ...
## $ Sales Outcome : num [1:1000] 0 1 0 1 1 0 0 0 0 0 ...
## $ Product : chr [1:1000] "LearnSys" "GTMSys" "GTMSys" "GTMSys" ...
## $ Industry : chr [1:1000] "Banks" "Airline" "Capital Markets" "Insurance" ...
## $ Region : chr [1:1000] "Africa" "UK" "UK" "UK" ...
## $ Relative Strength in the segment: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
## $ Profit of Customer in Million : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
## $ Sales Value in Million : num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
## $ Profit % : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
## $ WSES Proportion in Joint Bid : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
## $ Leads Conversion Class : chr [1:1000] "F" "E" "F" "E" ...# #List Column Headers
names(bb)
## [1] "Opportunity No." "Reporting Status"
## [3] "Sales Outcome" "Product"
## [5] "Industry" "Region"
## [7] "Relative Strength in the segment" "Profit of Customer in Million"
## [9] "Sales Value in Million" "Profit %"
## [11] "WSES Proportion in Joint Bid" "Leads Conversion Class"
#
# #Rename Headers
bb_headers <- c("SN" , "RS" , "SO" , "PDT" , "INT" , "RG" , "RS1" , "PM" , "SVM" , "PP" , "JB" , "LCC")
names(bb) <- bb_headers
#
# #Verification
names(bb)
## [1] "SN" "RS" "SO" "PDT" "INT" "RG" "RS1" "PM" "SVM" "PP" "JB" "LCC"
#From the case study, it can be seen that multiple columns are categorical (factor) or ordinal (ordered factor)
Question: What is the importance of having this kind of order factor over simple factor
Question: If ‘RS’ is already integer 0 & 1, then why convert it to factor
Question: Why LCC is NOT ordinal (Aside)
# #"Reporting Status i.e. RS" Converting "character" to "factor" and Label them
bb$RS <- factor(bb$RS, levels = c("Lost", "Won"), labels = c("0", "1"))
#
# #"Sales Outcome i.e. SO" Converting "numeric" to "factor"
bb$SO <- factor(bb$SO)
#
# #"Product Vertical i.e. PDT" Ordinal
# #What are the unique values in this column
unique(bb$PDT)
## [1] "LearnSys" "GTMSys" "Lifesys" "Finsys" "Procsys" "Logissys" "ContactSys"
#
# #Converting "character" to "Ordered factor"
# #Note: If level order is not provided, by default, alphabatical ordering will be assigned.
levels(factor(bb$PDT, ordered = TRUE))
## [1] "ContactSys" "Finsys" "GTMSys" "LearnSys" "Lifesys" "Logissys" "Procsys"
#
# #Provide ordering of factor levels in Ascending Order.
bb$PDT <- factor(bb$PDT, ordered = TRUE,
levels = c("GTMSys", "Procsys", "LearnSys", "Finsys", "Lifesys", "Logissys", "ContactSys"))
#
# #"Industry i.e. INT" Ordinal
bb$INT <- factor(bb$INT, ordered = TRUE,
levels = c("Capital Markets", "Banks", "Defense", "Consumer goods", "Others", "Security",
"Energy", "Insurance", "Airline", "Finance", "Infrastructure", "Mobility", "Other Govt.",
"Govt.", "Telecom equipments", "Health", "Clinical research", "Agriculture"))
#
# #"Region i.e. RG" Ordinal
bb$RG <- factor(bb$RG, ordered = TRUE, levels = c("UK", "Other Europe", "Americas", "Africa",
"India", "Japan", "Singapore", "Spain", "Canada"))
#
# #"Leads Conversion Class i.e. LCC" Ordinal, However we are going with Nominal here.
bb$LCC <- factor(bb$LCC, levels = c("E", "V", "F", "L"), labels = c(1, 2, 3, 4))str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ SN : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
## $ RS : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
## $ SO : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
## $ PDT: Ord.factor w/ 7 levels "GTMSys"<"Procsys"<..: 3 1 1 1 1 5 4 4 1 4 ...
## $ INT: Ord.factor w/ 18 levels "Capital Markets"<..: 2 9 1 8 5 5 4 4 16 4 ...
## $ RG : Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
## $ RS1: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
## $ PM : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
## $ SVM: num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
## $ PP : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
## $ JB : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
## $ LCC: Factor w/ 4 levels "1","2","3","4": 3 1 3 1 2 3 2 2 4 4 ...# #If there are "character" columns which should be "numeric"
bb$RS1 <- as.numeric(bb$RS1)
bb$PP <- as.numeric(bb$PP)
bb$JB <- as.numeric(bb$JB)
# #Equivalent
bb <- bb %>% mutate(across(c(RS1, PP, JB), as.numeric))Assume average sales of 8-million dollars and population standard deviation to be 2-million dollars. Check whether the average sales value in the population is at least 8 million dollars.
17.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
mean(bb$SVM) \(\#\mathcal{R}\){8.0442 - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\){mean(bb$SVM) - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\)pnorm(q = 0.6988, lower.tail = FALSE) \(\#\mathcal{R}\)1 - pnorm(q = 0.6988, lower.tail = TRUE) \(\#\mathcal{R}\)1 - pnorm(q = 0.6988) \(\#\mathcal{R}\)pnorm(q = -0.6988) \(\#\mathcal{R}\)pnorm() provides the probability to the left of z-valueIf the population standard deviation is unknown, evaluate the same hypothesis again.
{mean(bb$SVM) - 8} / {sd(bb$SVM) / sqrt(1000)} \(\#\mathcal{R}\)pt(q = 0.7049, df = nrow(bb) - 1, lower.tail = FALSE) \(\#\mathcal{R}\)t.test(bb$SVM, mu = 8, alternative = "greater") \(\#\mathcal{R}\)Check whether the proportion of leads won by WSES is more than 50%.
17.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
sqrt(0.50 * {1 - 0.50} / 1000) \(\#\mathcal{R}\){0.481 - 0.50}/{sqrt(0.50 * {1 - 0.50} / 1000)} \(\#\mathcal{R}\)pnorm(q = -1.2016, lower.tail = FALSE) \(\#\mathcal{R}\)1 - pnorm(q = -1.2016) \(\#\mathcal{R}\)# #Proportions
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
## SO PCT
## <fct> <dbl>
## 1 0 0.519
## 2 1 0.481
#
pnorm(q = -1.2016, lower.tail = FALSE)
## [1] 0.8852407# #Grouped Percentages
# #table() gives a count
table(bb$SO)
##
## 0 1
## 519 481
#
# #prop.table() can work only with numbers so it needs table()
prop.table(table(bb$SO))
##
## 0 1
## 0.519 0.481
#
# #Similar
bb %>% group_by(SO) %>% summarise(N = n()) %>% mutate(PCT = N / sum(N))
## # A tibble: 2 x 3
## SO N PCT
## <fct> <int> <dbl>
## 1 0 519 0.519
## 2 1 481 0.481
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
## SO PCT
## <fct> <dbl>
## 1 0 0.519
## 2 1 0.481Check whether the probability of winning a sales lead for the product “learnsys” is more than that of “Finsys.”
18.10 \(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)
# #Data | Subset | Filter | Update Factor levels
ii <- bb %>% select(PDT, SO, SVM) %>%
filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
## $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
## $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Count
table(ii$PDT, ii$SO)
##
## 0 1
## LearnSys 55 71
## Finsys 83 34
#
# #Proportion Table
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##
## 0 1
## LearnSys 0.437 0.563
## Finsys 0.709 0.291
#
prop.test(x = c(71, 34), n = c(126, 117), alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(71, 34) out of c(126, 117)
## X-squared = 17.316, df = 1, p-value = 1.583e-05
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.1644089 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.5634921 0.2905983# #Subset the dataset
ii <- data.frame(PDT = bb$PDT, SO = bb$SO, SVM = bb$SVM)
# #Subset only for required type of rows #243 Observations
jj <- subset(ii, PDT == "LearnSys" | PDT == "Finsys")
#
# #subset() is discouraged. Alternatives are given below
kk <- ii[ii$PDT %in% c("LearnSys", "Finsys"), ]
ll <- ii[which(ii$PDT %in% c("LearnSys", "Finsys")), ]
#
mm <- bb %>% select(PDT, SO, SVM) %>% filter(PDT == "LearnSys" | PDT == "Finsys")
nn <- bb %>% select(PDT, SO, SVM) %>% filter(PDT %in% c("LearnSys", "Finsys"))
#
stopifnot(all(identical(jj, kk), identical(jj, ll),
identical(as_tibble(jj), mm), identical(mm, nn)))
str(nn)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
## $ PDT: Ord.factor w/ 7 levels "GTMSys"<"Procsys"<..: 3 4 4 4 3 4 4 3 4 4 ...
## $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...ii <- bb %>% select(PDT, SO, SVM) %>%
filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
## $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
## $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Proportion Table: margin gives the margin to split by i.e.
# #1 means rowwise sum, 2 means columnwise
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##
## 0 1
## LearnSys 0.437 0.563
## Finsys 0.709 0.291
round(prop.table(table(ii$PDT, ii$SO), margin = 2), 3)
##
## 0 1
## LearnSys 0.399 0.676
## Finsys 0.601 0.324
#
# #Similar
ii %>% select(PDT, SO) %>%
count(PDT, SO) %>% pivot_wider(names_from = SO, values_from = n) %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 2 x 4
## PDT `0` `1` SUM
## <ord> <dbl> <dbl> <dbl>
## 1 LearnSys 43.7 56.3 100
## 2 Finsys 70.9 29.1 100Check whether the average sales value of “learnsys” projects is higher than that of “Finsys” projects. (\({\alpha} = 0.05\))
18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
“ForLater”
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
## $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
## $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
ii_var <- var.test(SVM ~ PDT, data = ii)
if(ii_var$p.value > 0.05) {
cat(paste0("Variances are same. Pooled Test can be applied. \n"))
t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = TRUE)
} else {
cat(paste0("Variances are NOT same. Welch Test should be applied. \n"))
t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = FALSE)
}
## Variances are same. Pooled Test can be applied.
##
## Two Sample t-test
##
## data: SVM by PDT
## t = 0.93503, df = 241, p-value = 0.1754
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
## -0.1728648 Inf
## sample estimates:
## mean in group LearnSys mean in group Finsys
## 8.030476 7.804786jj <- ii %>% filter(SO == "1")
jj_var <- var.test(SVM ~ PDT, data = jj)
if(jj_var$p.value > 0.05) {
t.test(formula = SVM ~ PDT, data = jj, alternative = "greater", var.equal = TRUE)
} else {
cat(paste0("Problem: Difference of means can be tested only if the variances are same.\n"))
}
##
## Two Sample t-test
##
## data: SVM by PDT
## t = 0.62152, df = 103, p-value = 0.2678
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
## -0.3949429 Inf
## sample estimates:
## mean in group LearnSys mean in group Finsys
## 7.947887 7.711471Check whether there is a difference in the average profit across the geographical locations: United Kingdom, India and the Americas.
21.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
pf(q = 1.5, df1 = 2, df2 = 689, lower.tail = FALSE) \(\#\mathcal{R}\)ii <- bb %>% filter(RG %in% c("UK", "India", "Americas")) %>% select(RG, PP, SO)
str(ii)
## tibble [692 x 3] (S3: tbl_df/tbl/data.frame)
## $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PP: num [1:692] 46 70 46 32 50 74 46 57 60 58 ...
## $ SO: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 1 1 1 ...
#
# #ANOVA
ii_aov <- aov(formula = PP ~ RG, data = ii)
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##
## 50.7052
##
## RG
## UK Americas India
## 50.6 50.33 53.51
## rep 553.0 104.00 35.00
#
# #Summary
summary(ii_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## RG 2 297 148.7 1.505 0.223
## Residuals 689 68075 98.8jj <- ii %>% filter(SO == "1")
str(jj)
## tibble [339 x 3] (S3: tbl_df/tbl/data.frame)
## $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 3 1 1 1 1 1 1 ...
## $ PP: num [1:339] 46 46 32 42 40 49 42 50 38 56 ...
## $ SO: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#
# #ANOVA
jj_aov <- aov(formula = PP ~ RG, data = jj)
#
# #
model.tables(jj_aov, type = "means")
## Tables of means
## Grand mean
##
## 46.37168
##
## RG
## UK Americas India
## 46.23 46.36 48.65
## rep 267.00 55.00 17.00
#
# #Summary
summary(jj_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## RG 2 93 46.75 0.532 0.588
## Residuals 336 29508 87.82Check whether the sales conversions are different for different geographical locations.
ii <- bb %>% select(RG, SO)
str(ii)
## tibble [1,000 x 2] (S3: tbl_df/tbl/data.frame)
## $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
## $ SO: Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
#
round(prop.table(table(ii$RG, ii$SO), margin = 1), 3)
##
## 0 1
## UK 0.517 0.483
## Other Europe 0.582 0.418
## Americas 0.471 0.529
## Africa 0.409 0.591
## India 0.514 0.486
## Japan 0.375 0.625
## Singapore 0.739 0.261
## Spain 0.917 0.083
## Canada 0.333 0.667
#
# #Chi-Sq Test
tryCatch(chisq.test(table(ii$RG, ii$SO)),
warning = function(w) {
print(paste0(w))
suppressWarnings(chisq.test(table(ii$RG, ii$SO)))
})
## [1] "simpleWarning in chisq.test(table(ii$RG, ii$SO)): Chi-squared approximation may be incorrect\n"
##
## Pearson's Chi-squared test
##
## data: table(ii$RG, ii$SO)
## X-squared = 22.263, df = 8, p-value = 0.004452Check whether the sales conversions depend on the sales value. Check this claim by making 3 groups of Sales Value: <6-million, [6-8], >8-million dollar
Both are categorical variables, So ChiSq Test is required
\(P_{\chi^2} > {\alpha} \to {H_0}\) cannot be rejected.
ii <- bb %>% select(SO, SVM)
#
# #Create 3 Groups with middle group inclusive of both 6 & 8
ii$RSVM <- cut(ii$SVM, breaks = c(0, 5.9999, 8, 15), labels = 1:3)
#
summary(ii)
## SO SVM RSVM
## 0:519 Min. : 1.640 1:155
## 1:481 1st Qu.: 6.690 2:346
## Median : 8.000 3:499
## Mean : 8.044
## 3rd Qu.: 9.440
## Max. :14.230
#
table(ii$RSVM, ii$SO)
##
## 0 1
## 1 86 69
## 2 182 164
## 3 251 248
#
# #Chi-Sq Test
chisq.test(table(ii$RSVM, ii$SO))
##
## Pearson's Chi-squared test
##
## data: table(ii$RSVM, ii$SO)
## X-squared = 1.377, df = 2, p-value = 0.5023Please import the Jayalaxmi data.
Figure 7.1 JAT: Usage and Users over the TimePeriod
bb <- xxJdata %>% rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(across(Dates, as_date)) %>% select(Dates, Week, Usage, Users)
#
ii <- bb %>%
mutate(NewDate = case_when(Week == "Week1" ~ Dates, Week == "Week2" ~ Dates + 7,
Week == "Week3" ~ Dates + 14, Week == "Week4" ~ Dates + 21)) %>%
mutate(DIFF = c(1, diff(NewDate)),
TF = ifelse(DIFF > 10, FALSE, TRUE),
grp = cumsum(DIFF > 10) + 1)
#
str(ii)
## tibble [123 x 8] (S3: tbl_df/tbl/data.frame)
## $ Dates : Date[1:123], format: "2015-06-01" "2015-07-01" "2015-07-01" ...
## $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
## $ Usage : num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
## $ Users : num [1:123] 2 1 1 4 6 12 13 10 7 12 ...
## $ NewDate: Date[1:123], format: "2015-06-22" "2015-07-01" "2015-07-08" ...
## $ DIFF : num [1:123] 1 9 7 7 7 10 7 7 7 10 ...
## $ TF : logi [1:123] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ grp : num [1:123] 1 1 1 1 1 1 1 1 1 1 ...# #Plot
hh <- ii
title_hh <- "Usage Pattern"
caption_hh <- "B15P01"
#
B15 <- hh %>% {
ggplot(data = ., aes(x = NewDate, y = Usage)) +
geom_point() +
#geom_smooth(aes(group = grp)) +
geom_line(aes(group = grp)) +
scale_x_date(date_breaks = "3 months",
labels = function(x) if_else(is.na(lag(x)) | !year(lag(x)) == year(x),
paste(month(x, label = TRUE), "\n", year(x)),
paste(month(x, label = TRUE)))) +
labs(x = "Time", y = "Usage",
caption = caption_hh, title = title_hh)
}
assign(caption_hh, B15)
rm(B15)bb <- xxJdata %>% rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(across(Dates, as_date)) %>% select(Dates, Week, Usage, Users)
#
ii <- bb %>%
mutate(NewDate = case_when(Week == "Week1" ~ Dates, Week == "Week2" ~ Dates + 7,
Week == "Week3" ~ Dates + 14, Week == "Week4" ~ Dates + 21)) %>%
mutate(DIFF = c(1, diff(NewDate)),
TF = ifelse(DIFF > 10, FALSE, TRUE),
grp = cumsum(DIFF > 10) + 1) %>%
mutate(grp_step = rle(TF)$lengths %>% {rep(seq(length(.)), .)})
#
jj <- c(1, which(ii$DIFF > 10), nrow(ii) + 1)
kk <- rep(1:length(diff(jj)), diff(jj))
stopifnot(identical(kk, as.integer(ii$grp)))
#
# #Where are the missing values
ii %>% filter( !TF | lead(!TF) | lag(!TF) )
## # A tibble: 3 x 9
## Dates Week Usage Users NewDate DIFF TF grp grp_step
## <date> <chr> <dbl> <dbl> <date> <dbl> <lgl> <dbl> <int>
## 1 2017-05-01 Week4 80 27 2017-05-22 7 TRUE 1 1
## 2 2017-09-01 Week4 1923 459 2017-09-22 123 FALSE 2 2
## 3 2017-10-01 Week1 1575 360 2017-10-01 9 TRUE 2 3Test the claim that disease 6 (leaf curl) information was accessed at least 60 times every month on average since October 2017 due to this disease outbreak. \(({\alpha} = 0.05)\)
17.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
# #Data | Rename | Sum Months D6 | Change from Date Time to Date
bb <- xxJdata %>%
rename(Dates = "Month-Year") %>%
group_by(Dates) %>%
summarise(D6 = sum(D6)) %>%
mutate(across(Dates, as_date))
#
# #There are missing months, but those months are not applicable in this question
ii <- bb %>% filter(Dates >= "2017-10-01")
#
t_ii <- {mean(ii$D6) - 60} / {sd(ii$D6) / sqrt(nrow(ii))}
print(t_ii)
## [1] 5.377075
pt(t_ii, df = nrow(ii) - 1, lower.tail = FALSE)
## [1] 0.000516808
#
# #One Sample t-Test
t.test(ii$D6, mu = 60, alternative = "greater", conf.level = 0.05)
##
## One Sample t-test
##
## data: ii$D6
## t = 5.3771, df = 7, p-value = 0.0005168
## alternative hypothesis: true mean is greater than 60
## 5 percent confidence interval:
## 314.2406 Inf
## sample estimates:
## mean of x
## 248str(bb)
## tibble [33 x 2] (S3: tbl_df/tbl/data.frame)
## $ Dates: Date[1:33], format: "2015-06-01" "2015-07-01" "2015-08-01" ...
## $ D6 : num [1:33] 0 6 31 41 74 104 91 88 74 88 ...
summary(bb)
## Dates D6
## Min. :2015-06-01 Min. : 0.0
## 1st Qu.:2016-02-01 1st Qu.: 71.0
## Median :2016-10-01 Median : 91.0
## Mean :2016-10-25 Mean :130.2
## 3rd Qu.:2017-09-01 3rd Qu.:198.0
## Max. :2018-05-01 Max. :365.0
#
# #Assuming each row is one month with no duplicates
stopifnot(identical(anyDuplicated(bb$Dates), 0L))
#
# #Create Sequence of Months
#ii <- seq(ymd("2015-6-1"), ymd("2018-5-1"), by = "months")
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
#
diff_len <- nrow(ii) - nrow(bb)
#
if(!identical(diff_len, 0L)) {
cat(paste0("Number of missing months = ", diff_len,"\n"))
#
# #Find Values that should be in Complete Sequence but are missing in the data
as_date(setdiff(ii$Dates, bb$Dates))
# #OR
ii %>% anti_join(bb)
#
# #This does not need a separate Vector of all Months
# #Get Months Difference using Integer Division and
# #Filter Rows which are not consecutive and rows above them
bb %>%
mutate(diff_months = (interval(lag(Dates), Dates)) %/% months(1)) %>%
filter( (diff_months != 1) | lead(diff_months != 1) )
}
## Number of missing months = 3
## # A tibble: 2 x 3
## Dates D6 diff_months
## <date> <dbl> <dbl>
## 1 2017-05-01 51 1
## 2 2017-09-01 111 4# #Fill Missing Months
jj <- as_tibble(merge(bb, ii, by = "Dates", all = TRUE))
kk <- right_join(bb, ii, by = "Dates") %>% arrange(Dates)
stopifnot(identical(jj, kk))
# #Replace NA
ll <- kk %>% mutate(across(D6, coalesce, 0)) Test the claim that Among the app users for disease information, at least 15% of them access disease information related to disease 6. \(({\alpha} = 0.05)\)
17.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Determine whether the proportion of farmers searching for D6 is more than \(p_0 = 0.15\)
17.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
# #Data | Sum Disease, Variety, Micronutrients
aa <- xxJdata %>%
mutate(sumD = rowSums(across(starts_with("D"))),
sumV = rowSums(across(starts_with("V")))) %>%
rename(Dates = "Month-Year", Users = "No of users", Micro = "Micronutrient") %>%
mutate(SUM = rowSums(across(c(sumD, sumV, Micro))),
DIFF = Usage - SUM) %>%
select(Dates, Users, Usage, SUM, DIFF, sumD, sumV, Micro, D6) %>%
mutate(across(Dates, as_date)) %>%
mutate(Fraction = D6/sumD)
#
# #Confirmed that Usage is Sum Total of Disease, Variety, Micronutrients
unique(aa$DIFF)
## [1] 0
#
# #Working Set | Exclude 1 NA i.e. where sumD is zero
bb <- aa %>% drop_na(Fraction) %>% select(Usage, sumD, D6, Fraction)
#
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(sumD, D6), sum))
## # A tibble: 1 x 2
## sumD D6
## <dbl> <dbl>
## 1 26830 4295
#
# #One Sample Proportion Test with continuity correction
bb_prop <- prop.test(x = sum(bb$D6), n = sum(bb$sumD), p = 0.15,
alternative = "greater", conf.level = 0.95)
bb_prop
##
## 1-sample proportions test with continuity correction
##
## data: sum(bb$D6) out of sum(bb$sumD), null probability 0.15
## X-squared = 21.311, df = 1, p-value = 1.953e-06
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
## 0.1564156 1.0000000
## sample estimates:
## p
## 0.160082# #Impact if we try to evaluate propotion of D6 searches out of ALL Usage (Disease, Variety, Micro)
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(Usage, D6), sum))
## # A tibble: 1 x 2
## Usage D6
## <dbl> <dbl>
## 1 71646 4295
#
# #One Sample Proportion Test with continuity correction
# #With p-value = 1, we cannot claim that 15% searches are for D6 only out of ALL Usage
prop.test(x = sum(bb$D6), n = sum(bb$Usage), p = 0.15,
alternative = "greater", conf.level = 0.95)
##
## 1-sample proportions test with continuity correction
##
## data: sum(bb$D6) out of sum(bb$Usage), null probability 0.15
## X-squared = 4556.2, df = 1, p-value = 1
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
## 0.05849838 1.00000000
## sample estimates:
## p
## 0.05994752# #One Sample t-Test
t.test(bb$Fraction, mu = 0.15, alternative = "greater", conf.level = 0.05)
##
## One Sample t-test
##
## data: bb$Fraction
## t = 1.324, df = 121, p-value = 0.09401
## alternative hypothesis: true mean is greater than 0.15
## 5 percent confidence interval:
## 0.163334 Inf
## sample estimates:
## mean of x
## 0.155921Test the claim that the average number of users in year 2017-2018 is more than average number of users in year 2015-2016. \(({\alpha} = 0.05)\)
18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
# #Data
bb <- xxJdata %>%
rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(across(Dates, as_date)) %>%
select(Dates, Users, Usage)
#
# #Missing Months
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
jj <- right_join(bb, ii, by = "Dates") %>% arrange(Dates) %>% mutate(across(Users, coalesce, 0))
#
# #Create 2 Groups
jj$Year <- cut(jj$Dates, breaks = c(min(ii$Dates), as_date("2017-01-01"), Inf),
labels = c("2015-2016", "2017-2018"))
#
# #Verify Changes
jj[!duplicated(jj$Year), ]
## # A tibble: 2 x 4
## Dates Users Usage Year
## <date> <dbl> <dbl> <fct>
## 1 2015-06-01 2 4 2015-2016
## 2 2017-01-01 92 495 2017-2018
jj %>% filter(Dates %in% ymd(c("2016-12-01", "2017-01-01")))
## # A tibble: 8 x 4
## Dates Users Usage Year
## <date> <dbl> <dbl> <fct>
## 1 2016-12-01 50 536 2015-2016
## 2 2016-12-01 54 318 2015-2016
## 3 2016-12-01 99 558 2015-2016
## 4 2016-12-01 104 573 2015-2016
## 5 2017-01-01 92 495 2017-2018
## 6 2017-01-01 130 578 2017-2018
## 7 2017-01-01 87 436 2017-2018
## 8 2017-01-01 60 261 2017-2018
#
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Users ~ Year, data = jj)
jj_var
##
## F test to compare two variances
##
## data: Users by Year
## F = 0.10358, num df = 72, denom df = 52, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.06158444 0.17055359
## sample estimates:
## ratio of variances
## 0.1035754
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
jj_t <- t.test(formula = Users ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
jj_t
##
## Welch Two Sample t-test
##
## data: Users by Year
## t = -7.255, df = 59.87, p-value = 4.641e-10
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
## -Inf -92.96694
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018
## 50.06849 170.84906
#
# #Alternatively, we can reverse Factor levels to perform Upper Tail Test
kk <- jj
kk$Year <- factor(kk$Year, levels = rev(levels(jj$Year)))
#
t.test(formula = Users ~ Year, data = kk, alternative = "greater", var.equal = isVarEqual)
##
## Welch Two Sample t-test
##
## data: Users by Year
## t = 7.255, df = 59.87, p-value = 4.641e-10
## alternative hypothesis: true difference in means between group 2017-2018 and group 2015-2016 is greater than 0
## 95 percent confidence interval:
## 92.96694 Inf
## sample estimates:
## mean in group 2017-2018 mean in group 2015-2016
## 170.84906 50.06849# #Data
str(jj)
## tibble [126 x 4] (S3: tbl_df/tbl/data.frame)
## $ Dates: Date[1:126], format: "2015-06-01" "2015-07-01" "2015-07-01" ...
## $ Users: num [1:126] 2 1 1 4 6 12 13 10 7 12 ...
## $ Usage: num [1:126] 4 1 25 70 100 291 225 141 148 215 ...
## $ Year : Factor w/ 2 levels "2015-2016","2017-2018": 1 1 1 1 1 1 1 1 1 1 ...
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Usage ~ Year, data = jj)
jj_var
##
## F test to compare two variances
##
## data: Usage by Year
## F = 0.8706, num df = 72, denom df = 49, p-value = 0.5853
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5117114 1.4434337
## sample estimates:
## ratio of variances
## 0.8705962
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
t.test(formula = Usage ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
##
## Two Sample t-test
##
## data: Usage by Year
## t = -4.7369, df = 121, p-value = 2.981e-06
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
## -Inf -252.37
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018
## 424.6849 812.9000Check whether app usage is same or different across the four weeks of a month. Test the claim that app usage picked up after January 2016. (Answered with Q3)
21.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
Figure 7.2 JAT: QQ Plot of Usage and Sqrt(Usage)
# #Data | Missing Months can be ignored because those are missing across all weeks
bb <- xxJdata %>%
rename(Dates = "Month-Year", Users = "No of users") %>%
select(Week, Usage)
#
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
## $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
## $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
## Week Usage
## Length:123 Min. : 1.0
## Class :character 1st Qu.: 286.5
## Mode :character Median : 450.0
## Mean : 582.5
## 3rd Qu.: 749.5
## Max. :3462.0
#
# #ANOVA (on original data : neither normal, nor of equal variance)
bb_aov <- aov(formula = Usage ~ Week, data = bb)
#
# #
model.tables(bb_aov, type = "means")
## Tables of means
## Grand mean
##
## 582.4959
##
## Week
## Week1 Week2 Week3 Week4
## 551.9 522.2 480 764.7
## rep 31.0 30.0 30 32.0
#
# #Summary
summary(bb_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Week 3 1515178 505059 2.22 0.0894 .
## Residuals 119 27074553 227517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
bb_aov
## Call:
## aov(formula = Usage ~ Week, data = bb)
##
## Terms:
## Week Residuals
## Sum of Squares 1515178 27074553
## Deg. of Freedom 3 119
##
## Residual standard error: 476.9877
## Estimated effects may be unbalanced
#
kruskal.test(Usage ~ Week, data = bb)
##
## Kruskal-Wallis rank sum test
##
## data: Usage by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674
#
# #Poisson Test (ForLater)
#anova(glm(Usage ~ Week, data = ii, family = poisson), test = "LRT")
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
str(ii)
## tibble [123 x 3] (S3: tbl_df/tbl/data.frame)
## $ Week : Factor w/ 4 levels "Week1","Week2",..: 4 1 2 3 4 1 2 3 4 1 ...
## $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
## $ Sqrt : num [1:123] 2 1 5 8.37 10 ...
summary(ii)
## Week Usage Sqrt
## Week1:31 Min. : 1.0 Min. : 1.00
## Week2:30 1st Qu.: 286.5 1st Qu.:16.93
## Week3:30 Median : 450.0 Median :21.21
## Week4:32 Mean : 582.5 Mean :22.35
## 3rd Qu.: 749.5 3rd Qu.:27.38
## Max. :3462.0 Max. :58.84
#
# #ANOVA
ii_aov <- aov(formula = Sqrt ~ Week, data = ii)
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##
## 22.35259
##
## Week
## Week1 Week2 Week3 Week4
## 21.98 21.57 20.74 24.95
## rep 31.00 30.00 30.00 32.00
#
# #Summary
summary(ii_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Week 3 317 105.75 1.274 0.286
## Residuals 119 9874 82.98
#
kruskal.test(Sqrt ~ Week, data = ii)
##
## Kruskal-Wallis rank sum test
##
## data: Sqrt by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674Statistical tests for comparing the variances of two or more samples. Equal variances across samples is called homogeneity of variances.
# #Data
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
## $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
## $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
## Week Usage
## Length:123 Min. : 1.0
## Class :character 1st Qu.: 286.5
## Mode :character Median : 450.0
## Mean : 582.5
## 3rd Qu.: 749.5
## Max. :3462.0
#
# #Bartlett Test
bartlett.test(Usage ~ Week, data = bb)
##
## Bartlett test of homogeneity of variances
##
## data: Usage by Week
## Bartlett's K-squared = 25.89, df = 3, p-value = 1.006e-05
#
# #Levene Test
ii <- bb %>% mutate(Week = factor(Week))
leveneTest(Usage ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 2.8691 0.0394 *
## 119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# #Fligner-Killeen test
fligner.test(Usage ~ Week, data = bb)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Usage by Week
## Fligner-Killeen:med chi-squared = 8.5854, df = 3, p-value = 0.03534
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
bartlett.test(Sqrt ~ Week, data = ii)
##
## Bartlett test of homogeneity of variances
##
## data: Sqrt by Week
## Bartlett's K-squared = 10.656, df = 3, p-value = 0.01374
leveneTest(Sqrt ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 2.4529 0.06668 .
## 119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fligner.test(Sqrt ~ Week, data = ii)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Sqrt by Week
## Fligner-Killeen:med chi-squared = 6.6688, df = 3, p-value = 0.08324# #Are the data from each of the 4 groups follow a normal distribution
# #Shapiro-Wilk normality test
bb %>% mutate(Week = factor(Week)) %>%
group_by(Week) %>%
summarise(N = n(), Mean = mean(Usage), SD = sd(Usage),
p_Shapiro = shapiro.test(Usage)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 6
## Week N Mean SD p_Shapiro isNormal
## <fct> <int> <dbl> <dbl> <dbl> <lgl>
## 1 Week1 31 552. 384. 0.00631 FALSE
## 2 Week2 30 522. 353. 0.0173 FALSE
## 3 Week3 30 480. 333. 0.000843 FALSE
## 4 Week4 32 765. 714. 0.0000964 FALSE
# #Check Q-Q plot
#qqnorm(bb[bb$Week == "Week1", ]$Usage)
#
# #Transformation: Log (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>%
mutate(Log = log(Usage)) %>%
group_by(Week) %>%
summarise(p_Shapiro = shapiro.test(Log)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.000000533 FALSE
## 2 Week2 0.0314 FALSE
## 3 Week3 0.689 TRUE
## 4 Week4 0.00125 FALSE
#
# #Transformation: Square Root (Success: Passed Normality) - Selected
bb %>% mutate(Week = factor(Week)) %>%
mutate(Sqrt = sqrt(Usage)) %>%
group_by(Week) %>%
summarise(p_Shapiro = shapiro.test(Sqrt)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.493 TRUE
## 2 Week2 0.967 TRUE
## 3 Week3 0.108 TRUE
## 4 Week4 0.732 TRUE
#
# #Transformation: Cube Root (Success: Passed Normality) Just to check
bb %>% mutate(Week = factor(Week)) %>%
mutate(CubeRoot = Usage^(1/3)) %>%
group_by(Week) %>%
summarise(p_Shapiro = shapiro.test(CubeRoot)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.0979 TRUE
## 2 Week2 0.980 TRUE
## 3 Week3 0.359 TRUE
## 4 Week4 0.997 TRUE
#
# #Testing Residuals i.e. Data - Group Mean (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>%
group_by(Week) %>%
mutate(Residuals = Usage - mean(Usage)) %>%
summarise(p_Shapiro = shapiro.test(Residuals)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.00631 FALSE
## 2 Week2 0.0173 FALSE
## 3 Week3 0.000843 FALSE
## 4 Week4 0.0000964 FALSEbb <- xxJdata %>%
rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(Week = factor(Week)) %>%
mutate(Sqrt = sqrt(Usage)) %>%
select(Week, Usage, Sqrt)
#
hh <- bb
title_hh <- "QQ Plot of Usage"
caption_hh <- "B15P04"
#
B15 <- hh %>% { ggplot(., aes(sample = Usage, colour = Week)) +
stat_qq() +
stat_qq_line() +
labs(caption = caption_hh, title = title_hh)
}
assign(caption_hh, B15)
rm(B15)
#
title_hh <- "QQ Plot of Sqrt(Usage)"
caption_hh <- "B15P05"
B15 <- hh %>% { ggplot(., aes(sample = Sqrt, colour = Week)) +
stat_qq() +
stat_qq_line() +
labs(caption = caption_hh, title = title_hh)
}
assign(caption_hh, B15)
rm(B15)18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
A new version of the app was released in August-2016. Which month in the given time frame after the launch of the new version, the mean usage pattern would start to show a statistically significant shift
# #Data
bb <- xxJdata %>% rename(Dates = "Month-Year") %>% mutate(across(Dates, as_date)) %>%
select(Dates, Week, Usage)
#
# #Create 2 Groups
bb$Year <- cut(bb$Dates, breaks = c(min(bb$Dates), as_date("2016-08-01"), Inf),
labels = c("OldApp", "NewApp"))
#
# #For Two Sample t-test, check if Variances are equal
bb_var <- var.test(Usage ~ Year, data = bb)
bb_var
##
## F test to compare two variances
##
## data: Usage by Year
## F = 0.16799, num df = 52, denom df = 69, p-value = 4.775e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1014182 0.2835174
## sample estimates:
## ratio of variances
## 0.1679867
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(bb_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "OldApp", 2 is "NewApp", we need to perform Lower Tail Test
bb_t <- t.test(formula = Usage ~ Year, data = bb, alternative = "less", var.equal = isVarEqual)
bb_t
##
## Welch Two Sample t-test
##
## data: Usage by Year
## t = -6.0938, df = 96.698, p-value = 1.124e-08
## alternative hypothesis: true difference in means between group OldApp and group NewApp is less than 0
## 95 percent confidence interval:
## -Inf -317.4977
## sample estimates:
## mean in group OldApp mean in group NewApp
## 334.1132 770.5571
## #Rolling Sums
old_sd <- sd(bb[bb$Year == "OldApp", ]$Usage)
old_n <- summary(bb$Year)[1]
#
ii <- bb %>% filter(Year == "NewApp", Dates <= '2017-05-01') %>% select(-Year)
#
jj <- ii %>% mutate(ID = row_number(), cSUM = cumsum(Usage), cMean = cSUM/ID) %>%
mutate(SD = across(Usage, ~ rollapply(., ID, sd, fill = NA, align = "right"))) %>%
mutate(DOF = floor({SD^2 / ID + old_sd^2 / old_n }^2 /
{{SD^2 / ID}^2/{ID-1} + {old_sd^2 / old_n}^2/{old_n-1}})) %>%
mutate(Sigma = sqrt({SD^2 /ID + old_sd^2 /old_n}))
str(jj)
## tibble [40 x 9] (S3: tbl_df/tbl/data.frame)
## $ Dates: Date[1:40], format: "2016-08-01" "2016-08-01" "2016-08-01" ...
## $ Week : chr [1:40] "Week1" "Week2" "Week3" "Week4" ...
## $ Usage: num [1:40] 421 387 264 788 691 256 261 377 295 749 ...
## $ ID : int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
## $ cSUM : num [1:40] 421 808 1072 1860 2551 ...
## $ cMean: num [1:40] 421 404 357 465 510 ...
## $ SD : tibble [40 x 1] (S3: tbl_df/tbl/data.frame)
## ..$ Usage: num [1:40] NA 24 82.6 225.6 220 ...
## $ DOF :'data.frame': 40 obs. of 1 variable:
## ..$ Usage: num [1:40] NA 14 3 3 4 6 7 9 11 13 ...
## $ Sigma:'data.frame': 40 obs. of 1 variable:
## ..$ Usage: num [1:40] NA 34.9 56.6 116.9 103 ...If a disease is likely to spread in particular weather condition (data given in the disease index sheet), then the access of that disease should be more in the months having suitable weather conditions. Help the analyst in coming up with a statistical test to support the claim for two districts for which the sample of weather and disease access data is provided in the data sheet. Identify the diseases for which you can support this claim. Test this claim both for temperature and relative humidity at 95% confidence.
18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
| p- Var | Equal Var | p- t-test | Upper H0 | |
|---|---|---|---|---|
| D1 ~ iD1 | 0.00002 | FALSE | 0.00084 | Rejected |
| D2 ~ iD2 | 0.00009 | FALSE | 0.00076 | Rejected |
| D3 ~ iD3 | 0.0002 | FALSE | 0.00399 | Rejected |
| D4 ~ iD4 | 0.00029 | FALSE | 0.06659 | Not Rejected |
| D5 ~ iD5 | 0.37691 | TRUE | 0.01814 | Rejected |
| D7 ~ iD7 | 0.90392 | TRUE | 0.00239 | Rejected |
Figure 7.3 JAT: Disease Searches grouped with Faourable and unfavourable Condisions (T and RH)
# #Merge both dataframes of Two districts
aa <- bind_rows(Belagavi = xxJbela, Dharwad = xxJdhar, .id = 'source') %>%
rename(Dates = Months, RH = "Relative Humidity", TMP = "Temperature") %>%
mutate(across(Dates, as_date)) %>% mutate(source = factor(source)) %>%
select(-c(10:13)) %>% select(-D6)
#
# #Based on Conditional T & RH, get each disease favourable condition = TRUE
q6_bb <- aa %>% mutate(iD1 = ifelse(TMP <= 24 & TMP >= 20 & RH > 80, TRUE, FALSE),
iD2 = ifelse(TMP <= 24.5 & TMP >= 21.5 & RH > 83, TRUE, FALSE),
iD3 = ifelse(TMP <= 24 & TMP >= 22, TRUE, FALSE),
iD4 = ifelse(TMP <= 26 & TMP >= 22 & RH > 85, TRUE, FALSE),
iD5 = ifelse(TMP <= 24.5 & TMP >= 22 & RH <= 85 & RH >= 77, TRUE, FALSE),
iD7 = ifelse(TMP > 25 & RH > 80, TRUE, FALSE)) %>%
mutate(across(starts_with("i"), factor, levels = c(TRUE, FALSE)))
bb <- q6_bb
#
# #Create all Formulae for variance and t-test
formulas <- paste0(names(bb)[3:8], " ~ ", names(bb)[11:16])
#
# #Appply formulae
output <- t(sapply(formulas, function(f) {
test_var <- var.test(as.formula(f), data = bb)
isVarEqual <- ifelse(test_var$p.value > 0.05, TRUE, FALSE)
test_t <- t.test(as.formula(f), data = bb, alternative = "greater", var.equal = isVarEqual)
c("p- Var" = format(round(test_var$p.value, 5), scientific = FALSE),
"Equal Var" = ifelse(test_var$p.value > 0.05, TRUE, FALSE),
"p- t-test" = format(round(test_t$p.value, 5), scientific = FALSE),
"Upper H0" = ifelse(test_t$p.value > 0.05, "Not Rejected", "Rejected"))
}))bb <- q6_bb
hh <- q6_bb %>%
rename_with(~gsub("iD", "i", .x)) %>%
select(starts_with(c("D", "i"))) %>%
select(-Dates) %>%
pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
rename(Values = "D", Favourable = "i")
#
title_hh <- "BoxPlot of Searches for Diseases in both districts"
caption_hh <- "B15P03"
#
B15 <- hh %>% { ggplot(data = ., mapping = aes(x = Disease, y = Values, fill = Favourable)) +
geom_boxplot(outlier.shape = NA) +
#stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") +
#scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1),
size = 1, alpha = 0.7, colour = "#21908CFF") +
k_gglayer_box +
theme(
#legend.justification = c("right", "top"),
#legend.box.just = "right",
#legend.margin = margin(6, 6, 6, 6),
legend.position = c(.90, .95)
) +
labs(x = "Diseases", y = "Searches per month", fill = "Favourable",
caption = caption_hh, title = title_hh)
}
assign(caption_hh, B15)
rm(B15)# #rename_with() uses formula
# #Selection Helpers like starts_with() can take multiple conditions
# #pivot_longer() can return multiple groups
# #Pattern Match: Both First and Second Pattern can contain only 1 character
ii <- q6_bb %>%
rename_with(~gsub("iD", "i", .x)) %>%
select(starts_with(c("D", "i"))) %>% select(-Dates) %>%
pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
rename(Values = "D", Favourable = "i")
#
# #First Pattern can contain 1 or more characters but the Second can have only 1 character
jj <- q6_bb %>%
select(starts_with(c("D", "i"))) %>% select(-Dates) %>%
pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.*)(.)") %>%
rename(Values = "D", Favourable = iD)
stopifnot(identical(ii, jj))if(FALSE){# #WARNING: Installation may take some time.
install.packages("mice", dependencies = TRUE)
install.packages("car", dependencies = TRUE)
}Please import the "B16-Cars2.csv".
| sl.No. | mpg | cylinders | cubicinches | hp | weightlbs | time.to.60 | year | brand |
|---|---|---|---|---|---|---|---|---|
| 1 | 14.0 | 8 | 350 | 165 | 4209 | 12 | 1972 | US |
| 2 | 31.9 | 4 | 89 | 71 | 1925 | 14 | 1980 | Europe |
| 3 | 17.0 | 8 | 302 | 140 | 3449 | 11 | 1971 | US |
| 4 | 15.0 | 8 | 400 | 150 | 3761 | 10 | 1971 | US |
| 5 | 30.5 | 4 | 98 | 63 | 2051 | 17 | 1978 | US |
| 6 | 23.0 | 8 | 350 | 125 | 3900 | 17 | 1980 | US |
# #Structure
str(xxCars)
## spec_tbl_df [263 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ sl.No. : num [1:263] 1 2 3 4 5 6 7 8 9 10 ...
## $ mpg : num [1:263] 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
## $ cylinders : num [1:263] 8 4 8 8 4 8 8 8 5 4 ...
## $ cubicinches: num [1:263] 350 89 302 400 98 350 351 440 183 89 ...
## $ hp : num [1:263] 165 71 140 150 63 125 158 215 77 62 ...
## $ weightlbs : num [1:263] 4209 1925 3449 3761 2051 ...
## $ time.to.60 : num [1:263] 12 14 11 10 17 17 13 9 20 17 ...
## $ year : num [1:263] 1972 1980 1971 1971 1978 ...
## $ brand : chr [1:263] "US" "Europe" "US" "US" ...# #Summary
summary(xxCars)
## sl.No. mpg cylinders cubicinches hp weightlbs
## Min. : 1.0 Min. : 10.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. : 192.5
## 1st Qu.: 66.5 1st Qu.: 16.95 1st Qu.:4.000 1st Qu.:103.0 1st Qu.: 75.5 1st Qu.:2245.5
## Median :132.0 Median : 22.00 Median :6.000 Median :156.0 Median : 95.0 Median :2830.0
## Mean :132.0 Mean : 25.07 Mean :5.593 Mean :201.5 Mean :106.3 Mean :2992.9
## 3rd Qu.:197.5 3rd Qu.: 28.90 3rd Qu.:8.000 3rd Qu.:302.0 3rd Qu.:137.5 3rd Qu.:3654.5
## Max. :263.0 Max. :527.00 Max. :8.000 Max. :455.0 Max. :230.0 Max. :4997.0
## time.to.60 year brand
## Min. : 8.00 Min. :1971 Length:263
## 1st Qu.:14.00 1st Qu.:1974 Class :character
## Median :16.00 Median :1977 Mode :character
## Mean :15.54 Mean :1977
## 3rd Qu.:17.00 3rd Qu.:1980
## Max. :25.00 Max. :1983aa <- xxCars #No missing value
bb <- aa #Will have missing value later
#
# #Identify the Number of Missing Values
if(anyNA(bb)) {
cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n"))
} else cat(paste0("NA not found.\n"))
## NA not found.
#
# #Record Some Values, before deleteting them
bb_22 <- bb$mpg[2] #bb[2, 2] #31.9
bb_39 <- bb$brand[3] #bb[3, 9] #"US"
bb_43 <- bb$cylinders[4] #bb[4, 3] #8
#
# #Delete
bb$mpg[2] <- bb$brand[3] <- bb$cylinders[4] <- NA
#
# #Identify the Number of Missing Values
cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n"))
## NA are Present! Total NA = 3
#
# #Which Columns have NA
#summary(bb)
bb_na_col <- colSums(is.na(bb))
#
# #Column Names with their Column Index
which(bb_na_col != 0)
## mpg cylinders brand
## 2 3 9
#
# #Number of NA in each Column
bb_na_col[which(bb_na_col != 0)]
## mpg cylinders brand
## 1 1 1
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 3
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 2 3 4
#ii <- bb
summary(ii$brand)
## Length Class Mode
## 263 character character
summary(factor(ii$brand))
## Europe Japan US NA's
## 48 51 163 1
#
# #table() by default does not show NA even in factor. However it has 'useNA' option
table(ii$brand)
##
## Europe Japan US
## 48 51 163
table(factor(ii$brand))
##
## Europe Japan US
## 48 51 163
table(ii$brand, useNA = "always")
##
## Europe Japan US <NA>
## 48 51 163 1ii <- bb
# #Mean Replacement
#ii %>% mutate(across(mpg, ~ replace(., is.na(.), round(mean(mpg, na.rm = TRUE), 2))))
ii$mpg[which(is.na(ii$mpg))] <- round(mean(ii$mpg, na.rm = TRUE), digits = 2)
jj <- ii
#
# #Median Replacement
ii <- bb
#ii %>% mutate(across(mpg, replace_na, round(median(mpg, na.rm = TRUE), 2)))
ii$mpg[which(is.na(ii$mpg))] <- round(median(ii$mpg, na.rm = TRUE), digits = 2)table(bb$brand, useNA = "always")
##
## Europe Japan US <NA>
## 48 51 163 1
bb %>% group_by(brand) %>% summarise(n())
## # A tibble: 4 x 2
## brand `n()`
## <chr> <int>
## 1 Europe 48
## 2 Japan 51
## 3 US 163
## 4 <NA> 1
#
# #Mode Replacement
#ii$brand[which(is.na(ii$brand))] <- f_getMode(ii$brand)
ii$brand[which(is.na(ii$brand))] <- "US"
#
# #Caution: Do not use max() on "character" for mode replacement
# #It will only look for ASCII value of letters
ii <- c("a", "z", "c", "b", "b", "b", "USA", NA, "a")
max(ii, na.rm = TRUE) #Wrong Value
## [1] "z"
f_getMode(ii)
## [1] "b"# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)
#
# #mice::md.pattern()
na_bb <- md.pattern(bb, plot = FALSE)
na_bb
## sl.No. cubicinches hp weightlbs time.to.60 year mpg cylinders brand
## 260 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 0 1 1 1
## 0 0 0 0 0 0 1 1 1 3Figure 8.1 Cars: Inserted Missing Value Pattern by md.pattern()
# #Choose Two Numbers from 1:10, Randomly
sample(1:10, 2)
## [1] 7 2
sample(1:10, 2)
## [1] 7 4
sample(1:10, 2)
## [1] 3 10
# #All above calls to generate Two random numbers produce different outcomes
# #Using set.seed() we can regenerate same random numbers everytime
set.seed(3)
sample(1:10, 2)
## [1] 5 7
sample(1:10, 2)
## [1] 4 8
#
set.seed(3)
sample(1:10, 2)
## [1] 5 7
sample(1:10, 2)
## [1] 4 8# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)
# #mice() for imputation
# #Including all relevant data i.e. skipping Serial Number only
impute <- mice(bb[, 2:9], m = 2, seed = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
#
print(impute)
## Class: mids
## Number of multiple imputations: 2
## Imputation methods:
## mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## "pmm" "pmm" "" "" "" "" "" "polyreg"
## PredictorMatrix:
## mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## mpg 0 1 1 1 1 1 1 1
## cylinders 1 0 1 1 1 1 1 1
## cubicinches 1 1 0 1 1 1 1 1
## hp 1 1 1 0 1 1 1 1
## weightlbs 1 1 1 1 0 1 1 1
## time.to.60 1 1 1 1 1 0 1 1
#
# #For each iteration we have a different set of imputed data
# #e.g. for 'mpg' in two sets values are
impute$imp$mpg
## 1 2
## 2 28 14
#
# #NOTE: Original Values that were removed earlier
tibble(mpg = bb_22, brand = bb_39, cylinders = bb_43)
## # A tibble: 1 x 3
## mpg brand cylinders
## <dbl> <chr> <dbl>
## 1 31.9 US 8
#
# #Complete First Set
set1_bb <- complete(impute, 1)
tibble(mpg = set1_bb$mpg[2], brand = set1_bb$brand[3], cylinders = set1_bb$cylinders[4])
## # A tibble: 1 x 3
## mpg brand cylinders
## <dbl> <fct> <dbl>
## 1 28 US 8
#
# #Complete Second Set
set2_bb <- complete(impute, 2)
tibble(mpg = set2_bb$mpg[2], brand = set2_bb$brand[3], cylinders = set2_bb$cylinders[4])
## # A tibble: 1 x 3
## mpg brand cylinders
## <dbl> <fct> <dbl>
## 1 14 US 8set.seed(3)
ii <- mice(bb[, 2:9], m = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
set.seed(3)
jj <- mice(bb[, 2:9], m = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
#
# #identical() is FALSE but all.equal() is TRUE
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE
#
# #Similarly
ii <- mice(bb[, 2:9], m = 3, seed = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
jj <- mice(bb[, 2:9], m = 3, seed = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE(External) MICE Package Author
# #Using the "character" to generate the Warning
ii <- aa
ii$mpg[2] <- ii$brand[3] <- ii$cylinders[4] <- NA
#
tryCatch(expr = {
jj <- mice(ii[, 2:9], m = 1, seed = 3)
}, warning = function(w) {
print(paste0(w))
})
##
## iter imp variable
## 1 1 mpg cylinders
## 2 1 mpg cylinders
## 3 1 mpg cylinders
## 4 1 mpg cylinders
## 5 1 mpg cylinders
## [1] "simpleWarning: Number of logged events: 1\n"
#
# #Warning message: Number of logged events
# #It can occur because of variety of issues in the data
jj$loggedEvents
## NULLRefer Outliers: C03 and Outliers: B12
Figure 8.2 Cars: Histogram and Density of Weight (lbs)
# Set up the plot area to visualise multiple 3 plots simultaneously
par(mfrow = c(1, 3))
# Create the histogram bars
hist(aa$weightlbs,
breaks = 30,
xlim = c(0, 5000),
col = "blue",
border = "black",
ylim = c(0, 40),
xlab = "Weight",
ylab = "Counts",
main = "Histogram of Car Weights")
# Make a box around # the plot
box(which = "plot",
lty = "solid",
col = "black")# #Histogram
#bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = aa$weightlbs)
title_hh <- "Cars: Histogram of Weight"
caption_hh <- "B16P02"
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
geom_text(data = tibble(x = mean_hh, y = -Inf,
label = paste0("Mean= ", mean_hh)),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
geom_text(data = tibble(x = median_hh, y = -Inf,
label = paste0("Median= ", median_hh)),
aes(x = x, y = y, label = label),
color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh,
")"),
caption = caption_hh, title = title_hh)
}
assign(caption_hh, B16)
rm(B16)# #Density Curve
title_hh <- "Cars: Desnsity Plot of Weight"
caption_hh <- "B16P03"
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# #Arrow
arr_y <- 0.0005 #mean(density_hh_tbl$y) #
arr_lst <- list(list("99.7%", xpoints_hh[1], xpoints_hh[7], arr_y),
list("95.4%", xpoints_hh[2], xpoints_hh[6], arr_y),
list("68.3%", xpoints_hh[3], xpoints_hh[5], arr_y))
arr_hh <- arr_lst[[1]]
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
#scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
scale_y_continuous(labels = function(n){format(n, scientific = FALSE)}) +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = arr_hh[[2]], y = arr_hh[[4]],
yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = arr_hh[[3]], y = arr_hh[[4]],
yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
annotate(geom = "text", x = xpoints_hh[4], y = arr_hh[[4]], label = arr_hh[[1]]) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Density",
subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1),
"; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
")"),
caption = caption_hh, title = title_hh)
}
assign(caption_hh, B16)
rm(B16)Figure 8.3 Cars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliers
# Create a Scatterplot
plot(aa$weightlbs,
aa$mpg,
xlim = c(0, 5000),
ylim = c(0, 600),
xlab = "Weight",
ylab = "MPG",
main = "Scatterplot of MPG by Weight",
type = "p", #Points
pch = 16,
col = "blue")
#Add open black
# circles
points(aa$weightlbs,
aa$mpg,
type = "p",
col = "black")# #IN: hh$x, hh$y, title_hh, caption_hh, x_hh, y_hh
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
B16 <- hh %>% { ggplot(data = ., aes(x = x, y = y)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_vline(aes(xintercept = round(mean(x), 3)), color = '#440154FF', linetype = "dashed") +
geom_hline(aes(yintercept = round(mean(y), 3)), color = '#440154FF', linetype = "dashed") +
geom_text(data = tibble(x = mean(.[["x"]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character")),
aes(x = x, y = y, label = label),
size = 4, color = '#440154FF', hjust = 1.5, vjust = -1, parse = TRUE ) +
geom_text(data = tibble(x = 0, y = mean(.[["y"]]),
label = TeX(r'($\bar{y}$)', output = "character")),
aes(x = x, y = y, label = label),
size = 4, color = '#440154FF', hjust = 1.5, vjust = 1.5, parse = TRUE ) +
geom_point() +
k_gglayer_scatter +
labs(x = x_hh, y = y_hh,
#subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"),
caption = caption_hh, title = title_hh)
}
assign(caption_hh, B16)
rm(B16)Figure 8.4 Cars: BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)
boxplot(mpg ~ cyl, data = aa, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")# #BoxPlot
hh <- aa %>% select(mpg, cylinders) %>% filter(!cylinders %in% c(3, 5)) %>%
filter(mpg < max(mpg)) %>% mutate(across(cylinders, factor))
#
title_hh <- "BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)"
caption_hh <- "B16P06"
x_hh <- "Cylinders"
y_hh <- "MPG"
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = cylinders, y = mpg, fill = cylinders)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitterdodge(jitter.width = 0.1), size = 1, alpha = 0.7) +
k_gglayer_box +
theme(legend.position = 'none') +
labs(x = x_hh, y = y_hh, caption = caption_hh, title = title_hh)
}
assign(caption_hh, B16)
rm(B16)bb <- aa
dim(bb)
## [1] 263 9
#
# #summary() or quantile()
summary(bb$weightlbs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192.5 2245.5 2830.0 2992.9 3654.5 4997.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
## 25% 75%
## 2245.5 3654.5
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 1409
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb
low_bb <- q_bb[1] - 1.5 * iqr_bb
#
#bb %>% filter(weightlbs >= {summary(weightlbs)[2] - 1.5 * IQR(weightlbs)} &
# weightlbs <= {summary(weightlbs)[5] + 1.5 * IQR(weightlbs)})
kept_bb <- subset(bb, bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb)
if(nrow(bb) == nrow(kept_bb)) {
cat(paste0("No Point was removed because none was outside the range.\n"))
} else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb),"\n"))
## No Point was removed because none was outside the range.bb <- aa %>% filter(cylinders == 6)
dim(bb)
## [1] 57 9
#
# #summary() or quantile()
summary(bb$weightlbs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192.5 2910.0 3121.0 3120.4 3415.0 3907.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
## 25% 75%
## 2910 3415
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 505
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb
low_bb <- q_bb[1] - 1.5 * iqr_bb
#
#bb %>% filter(weightlbs >= {summary(weightlbs)[2] - 1.5 * IQR(weightlbs)} &
# weightlbs <= {summary(weightlbs)[5] + 1.5 * IQR(weightlbs)})
kept_bb <- subset(bb, bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb)
if(nrow(bb) == nrow(kept_bb)) {
cat(paste0("No Point was removed because none was outside the range.\n"))
} else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb),"\n"))
## Number of Points removed = 1Definitions and Exercises are from the Book (David R. Anderson 2018)
Hence, the total number of data items can be determined by multiplying the number of observations by the number of variables.
For Example, Gender as Male and Female. In cases where the scale of measurement is nominal, a numerical code as well as a nonnumerical label may be used. For example, 1 denotes Male, 2 denotes Female. The scale of measurement is nominal even though the data appear as numerical values. Only Mode can be calculated.
For example, Size as small, medium, large. Along with the labels, similar to nominal data, the data can also be ranked or ordered, which makes the measurement scale ordinal. Ordinal data can also be recorded by a numerical code. Median can be calculated but not the Mean.
Interval data are always numerical. These can be ranked or ordered like ordinal. In addition, the differences between them are meaningful.
Variables such as distance, height, weight, and time use the ratio scale of measurement. This scale requires that a zero value be included to indicate that nothing exists for the variable at the zero point. Mean can be calculated.
For example, consider the cost of an automobile. A zero value for the cost would indicate that the automobile has no cost and is free. In addition, if we compare the cost of 30,000 dollars for one automobile to the cost of 15,000 dollars for a second automobile, the ratio property shows that the first automobile is 30000/15000 = 2 times, or twice, the cost of the second automobile.
See Table 9.1 for more details.
Interval scale is a measure of continuous quantitative data that has an arbitrary 0 reference point. This is contrasted with ratio scaled data which have a non-arbitrary 0 reference point. Ex: When we look at “profit” we see that negative profit does make sense to us. So while, the 0 for “profit” is meaningful (just like temperature measurements of Celsius) it is arbitrary. Therefore, it is Interval scale of measurement.
In an interval scale, you can take difference of two values. You may not be able to take ratios of two values. Ex: Temperature in Celsius. You can say that if temperatures of two places are 40 °C and 20 °C, then one is hotter than the other (taking difference). But you cannot say that first is twice as hot as the second (not allowed to take ratio).
In a ratio scale, you can take a ratio of two values. Ex: 40 kg is twice as heavy as 20 kg (taking ratios).
Also, “0” on ratio scale means the absence of that physical quantity. “0” on interval scale does not mean the same. 0 kg means the absence of weight. 0 °C does not mean absence of heat.
| Features | Interval scale | Ratio scale |
|---|---|---|
| Variable property | Addition and subtraction | Multiplication and Division i.e. calculate ratios. Thus, you can leverage numbers on the scale against 0. |
| Absolute Point Zero | Zero-point in an interval scale is arbitrary. For example, the temperature can be below 0 °C and into negative temperatures. | The ratio scale has an absolute zero or character of origin. Height and weight cannot be zero or below zero. |
| Calculation | Statistically, in an interval scale, the Arithmetic Mean is calculated. Statistical dispersion permits range and standard deviation. The coefficient of variation is not permitted. | Statistically, in a ratio scale, the Geometric or Harmonic mean is calculated. Also, range and coefficient of variation are permitted for measuring statistical dispersion. |
| Measurement | Interval scale can measure size and magnitude as multiple factors of a defined unit. | Ratio scale can measure size and magnitude as a factor of one defined unit in terms of another. |
| Example | Temperature in Celsius, Calendar years and time, Profit | These possesses an absolute zero characteristic, like age, weight, height, or Sales |
If the variable is categorical, the statistical analysis is limited. We can summarize categorical data by counting the number of observations in each category or by computing the proportion of the observations in each category. However, even when the categorical data are identified by a numerical code, arithmetic operations do not provide meaningful results.
Arithmetic operations provide meaningful results for quantitative variables. For example, quantitative data may be added and then divided by the number of observations to compute the average value.
Quantitative data may be discrete or continuous.
As a result, the data obtained from a well-designed experiment can often provide more information as compared to the data obtained from existing sources or by conducting an observational study.
Refer Sample For More …
The population is the set of entities under study.
Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions.
Whenever statisticians use a sample to estimate a population characteristic of interest, they usually provide a statement of the quality, or precision, associated with the estimate.
Inferential statistics are used for Hypothesis Testing.
The two most common types of Statistical Inference are -
Reasoning for Tests of Significance
Analytics is used for data-driven or fact-based decision making, which is often seen as more objective than alternative approaches to decision making. The tools of analytics can aid decision making by creating insights from data, improving our ability to more accurately forecast for planning, helping us quantify risk, and yielding better alternatives through analysis.
Analytics is now generally thought to comprise three broad categories of techniques. These categories are descriptive analytics, predictive analytics, and prescriptive analytics.
Examples of these types of techniques are data queries, reports, descriptive statistics, data visualization, data dash boards, and basic what-if spreadsheet models.
Linear regression, time series analysis, and forecasting models fall into the category of predictive analytics. Simulation, which is the use of probability and statistical computer models to better understand risk, also falls under the category of predictive analytics.
Prescriptive analytics differs greatly from descriptive or predictive analytics. What distinguishes prescriptive analytics is that prescriptive models yield a best course of action to take. That is, the output of a prescriptive model is a best decision.
Optimization models, which generate solutions that maximize or minimize some objective subject to a set of constraints, fall into the category of prescriptive models.
Volume refers to the amount of available data; velocity refers to the speed at which data is collected and processed; and variety refers to the different data types. The term data warehousing is used to refer to the process of capturing, storing, and maintaining the data.
Data mining relies heavily on statistical methodology such as multiple regression, logistic regression, and correlation.
cor(), lm() etc.xxComputers <- f_getObject("xxComputers", "C01-Computers.csv", "971fb6096e4f71e8185d3327a9033a10")
xxCordless <- f_getObject("xxCordless", "C01-Cordless.csv", "9991f612fe44f1c890440bd238084679")f_getObject <- function(x_name, x_source, x_md = "") {
# #Debugging
a07bug <- FALSE
# #Read the File or Object
# #Ex: xxCars <- f_getObject("xxCars", "S16-cars2.csv", "30051fb47f65810f33cb992015b849cc")
# #tools::md5sum("xx.csv") OR tools::md5sum(paste0(.z$XL, "xx", ".txt"))
#
# #Path to the File
loc_src <- paste0(.z$XL, x_source)
# #Path to the Object
loc_rds <- paste0(.z$XL, x_name, ".rds")
#
# #x_file[1] FILENAME & x_file[2] FILETYPE
x_file <- strsplit(x_source, "[.]")[[1]]
#
if(all(x_md == tools::md5sum(loc_src), file.exists(loc_rds),
file.info(loc_src)$mtime < file.info(loc_rds)$mtime)) {
# #Read RDS if (exists, newer than source, source not modified i.e. passes md5sum)
if(a07bug) print("A07 Flag 01: Reading from RDS")
return(readRDS(loc_rds))
} else if(!file.exists(loc_src)){
message("ERROR: File does not exist! : ", loc_src, "\n")
stop()
} else if(x_file[2] == "csv") {
# #Read CSV as a Tibble
# #col_double(), col_character(), col_logical(), col_integer()
# #DATETIME (EXCEL) "YYYY-MM-DD HH:MM:SS" imported as "UTC"
tbl <- read_csv(loc_src, show_col_types = FALSE)
# #Remove Unncessary Attributes
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
# #Write Object as RDS
saveRDS(tbl, loc_rds)
# #Return Object
if(a07bug) print("A07 Flag 02: Reading from Source and Saving as RDS")
return(tbl)
} else if(x_file[2] == "xlsx") {
# #Read All Sheets of Excel in a list
tbl <- lapply(excel_sheets(loc_src), read_excel, path = loc_src)
# #Write Object as RDS
saveRDS(tbl, loc_rds)
# #Return Object
return(tbl)
} else {
message("f_getObject(): UNKNOWN")
stop()
}
}bb <- xxComputers
#displ_names <- c("")
#stopifnot(identical(ncol(bb), length(displ_names)))
#
# #Kable Table
kbl(bb,
caption = "(C01T01)",
#col.names = displ_names,
escape = FALSE, align = "c", booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
html_font = "Consolas", font_size = 12,
full_width = FALSE,
#position = "float_left",
fixed_thead = TRUE
) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
row_spec(0, color = "white", background = "#303030", bold = TRUE,
extra_css = "border-bottom: 1px solid; border-top: 1px solid"
)bb <- tibble(Company = c("Hertz", "Dollar", "Avis"),
`2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
`2009` = c(286, 106, 300), `2010` = c(290, 108, 270))# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
## # A tibble: 12 x 3
## Company Year Values
## <chr> <chr> <dbl>
## 1 Hertz 2007 327
## 2 Hertz 2008 311
## 3 Hertz 2009 286
## 4 Hertz 2010 290
## 5 Dollar 2007 167
## 6 Dollar 2008 140
## 7 Dollar 2009 106
## 8 Dollar 2010 108
## 9 Avis 2007 204
## 10 Avis 2008 220
## 11 Avis 2009 300
## 12 Avis 2010 270# #Transpose
(ii <- bb %>%
pivot_longer(!Company, names_to = "Year", values_to = "Values") %>%
pivot_wider(names_from = Company, values_from = Values))
## # A tibble: 4 x 4
## Year Hertz Dollar Avis
## <chr> <dbl> <dbl> <dbl>
## 1 2007 327 167 204
## 2 2008 311 140 220
## 3 2009 286 106 300
## 4 2010 290 108 270
# #Equivalent
stopifnot(identical(ii,
bb %>% pivot_longer(-1) %>%
pivot_wider(names_from = 1, values_from = value) %>%
rename(., Year = name)))| SN | tablet | cost | os | display_inch | battery_hh | cpu |
|---|---|---|---|---|---|---|
| 1 | Acer Iconia W510 | 599 | Windows | 10.1 | 8.5 | Intel |
| 2 | Amazon Kindle Fire HD | 299 | Android | 8.9 | 9.0 | TI OMAP |
| 3 | Apple iPad 4 | 499 | iOS | 9.7 | 11.0 | Apple |
| 4 | HP Envy X2 | 860 | Windows | 11.6 | 8.0 | Intel |
| 5 | Lenovo ThinkPad Tablet | 668 | Windows | 10.1 | 10.5 | Intel |
| 6 | Microsoft Surface Pro | 899 | Windows | 10.6 | 4.0 | Intel |
| 7 | Motorola Droid XYboard | 530 | Android | 10.1 | 9.0 | TI OMAP |
| 8 | Samsung Ativ Smart PC | 590 | Windows | 11.6 | 7.0 | Intel |
| 9 | Samsung Galaxy Tab | 525 | Android | 10.1 | 10.0 | Nvidia |
| 10 | Sony Tablet S | 360 | Android | 9.4 | 8.0 | Nvidia |
# #What is the average cost for the tablets #$582.90
cat(paste0("Avg. Cost for the tablets is = $", round(mean(bb$cost), digits = 1), "\n"))
## Avg. Cost for the tablets is = $582.9
#
# #Compare the average cost of tablets with different OS (Windows /Android) #$723.20 $428.5
(ii <- bb %>%
group_by(os) %>%
summarise(Mean = round(mean(cost), digits =1)) %>%
arrange(desc(Mean)) %>%
mutate(Mean = paste0("$", Mean)))
## # A tibble: 3 x 2
## os Mean
## <chr> <chr>
## 1 Windows $723.2
## 2 iOS $499
## 3 Android $428.5
#
cat(paste0("Avg. Cost of Tablets with Windows OS is = ",
ii %>% filter(os == "Windows") %>% select(Mean), "\n"))
## Avg. Cost of Tablets with Windows OS is = $723.2# #What percentage of tablets use an Android operating system #40%
(ii <- bb %>%
group_by(os) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
arrange(desc(PCT)) %>%
mutate(PCT = paste0(PCT, "%")))
## # A tibble: 3 x 2
## os PCT
## <chr> <chr>
## 1 Windows 50%
## 2 Android 40%
## 3 iOS 10%
#
cat(paste0("Android OS is used in ",
ii %>% filter(os == "Android") %>% select(PCT), " Tablets\n"))
## Android OS is used in 40% Tablets| SN | brand | model | price | overall_score | voice_quality | handset_on_base | talk_time_hh |
|---|---|---|---|---|---|---|---|
| 1 | AT&T | CL84100 | 60 | 73 | Excellent | Yes | 7 |
| 2 | AT&T | TL92271 | 80 | 70 | Very Good | No | 7 |
| 3 | Panasonic | 4773B | 100 | 78 | Very Good | Yes | 13 |
| 4 | Panasonic | 6592T | 70 | 72 | Very Good | No | 13 |
| 5 | Uniden | D2997 | 45 | 70 | Very Good | No | 10 |
| 6 | Uniden | D1788 | 80 | 73 | Very Good | Yes | 7 |
| 7 | Vtech | DS6521 | 60 | 72 | Excellent | No | 7 |
| 8 | Vtech | CS6649 | 50 | 72 | Very Good | Yes | 7 |
# #What is the average price for the cordless telephones
cat(paste0("Avg. Price is = $", round(mean(bb$price), digits = 1), "\n"))
## Avg. Price is = $68.1
#
# #What is the average talk time for the cordless telephones
cat(paste0("Avg. Talk Time is = ", round(mean(bb$talk_time_hh), digits = 1), " Hours \n"))
## Avg. Talk Time is = 8.9 Hours# #What percentage of the cordless telephones have a voice quality of excellent
(hh <- bb %>%
group_by(voice_quality) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
mutate(voice_quality = factor(voice_quality,
levels = c("Very Good", "Excellent"), ordered = TRUE)) %>%
arrange(desc(voice_quality)) %>%
mutate(PCT = paste0(PCT, "%")))
## # A tibble: 2 x 2
## voice_quality PCT
## <ord> <chr>
## 1 Excellent 25%
## 2 Very Good 75%
#
cat(paste0("Percentage of 'Excellent' Voice Quality is = ",
hh %>% filter(voice_quality == "Excellent") %>% select(PCT), "\n"))
## Percentage of 'Excellent' Voice Quality is = 25%
#
# #Equivalent
print(bb %>%
group_by(voice_quality) %>%
summarise(PCT = n()) %>%
mutate(PCT = prop.table(PCT) * 100))
## # A tibble: 2 x 2
## voice_quality PCT
## <chr> <dbl>
## 1 Excellent 25
## 2 Very Good 75# #What percentage of the cordless telephones have a handset on the base
bb %>%
group_by(handset_on_base) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
arrange(desc(PCT)) %>%
mutate(PCT = paste0(PCT, "%")) %>%
filter(handset_on_base == "Yes")
## # A tibble: 1 x 2
## handset_on_base PCT
## <chr> <chr>
## 1 Yes 50%
|
|
bb <- tibble(Company = c("Hertz", "Dollar", "Avis"),
`2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
`2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
hh <- bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
# #Transpose
ii <- bb %>%
pivot_longer(!Company, names_to = "Year", values_to = "Values") %>%
pivot_wider(names_from = Company, values_from = Values)# #Save an Image
ggsave(paste0(.z$PX, "C01P01_Cars_TimeSeries", ".png"), plot = C01P01)# #Load an Image
knitr::include_graphics(paste0(.z$PX, "C01P01_Cars_TimeSeries", ".png"))Figure 9.1 Multiple Time Series Graph
# #who appears to be the market share leader
# #how the market shares are changing over time
print(ii)
## # A tibble: 4 x 4
## Year Hertz Dollar Avis
## <chr> <dbl> <dbl> <dbl>
## 1 2007 327 167 204
## 2 2008 311 140 220
## 3 2009 286 106 300
## 4 2010 290 108 270
# #Row Total
jj <- ii %>% rowwise() %>% mutate(SUM = sum(c_across(where(is.numeric)))) %>% ungroup()
kk <- ii %>% mutate(SUM = rowSums(across(where(is.numeric))))
stopifnot(identical(jj, kk))
#
# #Rowwise Percentage Share
ii %>%
rowwise() %>%
mutate(SUM = sum(c_across(where(is.numeric)))) %>%
ungroup() %>%
mutate(across(2:4, ~ round(. * 100 / SUM, digits = 1), .names = "{.col}.{.fn}")) %>%
mutate(across(ends_with(".1"), ~ paste0(., "%")))
## # A tibble: 4 x 8
## Year Hertz Dollar Avis SUM Hertz.1 Dollar.1 Avis.1
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 2007 327 167 204 698 46.8% 23.9% 29.2%
## 2 2008 311 140 220 671 46.3% 20.9% 32.8%
## 3 2009 286 106 300 692 41.3% 15.3% 43.4%
## 4 2010 290 108 270 668 43.4% 16.2% 40.4%# #Bar Plot
aa <- bb %>%
select(Company, `2010`) %>%
rename("Y2010" = `2010`) %>%
arrange(desc(.[2])) %>%
mutate(cSUM = cumsum(Y2010)) %>%
mutate(PCT = 100 * Y2010 / sum(Y2010)) %>%
mutate(cPCT = 100 * cumsum(Y2010) / sum(Y2010)) %>%
mutate(across(Company, factor, levels = unique(Company), ordered = TRUE))
# #
pareto_chr <- setNames(c(aa$Y2010), aa$Company)
stopifnot(identical(pareto_chr, aa %>% pull(Y2010, Company)))
stopifnot(identical(pareto_chr, aa %>% select(1:2) %>% deframe()))
# #Plot Pareto
C01P02 <- pareto.chart(pareto_chr,
xlab = "Company", ylab = "Cars",
# colors of the chart
#col=heat.colors(length(pareto_chr)),
# ranges of the percentages at the right
cumperc = seq(0, 100, by = 20),
# label y right
ylab2 = "Cumulative Percentage",
# title of the chart
main = "Pareto Chart"
)Figure 9.2 Pareto of Cars in 2010
The relative frequency of a class equals the fraction or proportion of observations belonging to a class i.e. it is out of 1 whereas ‘percent frequency’ is out of 100%.
Rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class.
ggplot() does not allow easy setup of dual axis| softdrink | Frequency | cSUM | PROP | PCT | cPCT |
|---|---|---|---|---|---|
| Coca-Cola | 19 | 19 | 38 | 38% | 38% |
| Pepsi | 13 | 32 | 26 | 26% | 64% |
| Diet Coke | 8 | 40 | 16 | 16% | 80% |
| Dr. Pepper | 5 | 45 | 10 | 10% | 90% |
| Sprite | 5 | 50 | 10 | 10% | 100% |
Figure 10.1 Bar Chart and Pie Chart of Frequency
# #Frequency Distribution
aa <- tibble(softdrink = c("Coca-Cola", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite"),
Frequency = c(19, 8, 5, 13, 5))
#
# #Sort, Cummulative Sum, Percentage, and Cummulative Percentage
bb <- aa %>%
arrange(desc(Frequency)) %>%
mutate(cSUM = cumsum(Frequency)) %>%
mutate(PROP = 100 * Frequency / sum(Frequency)) %>%
mutate(PCT = paste0(PROP, "%")) %>%
mutate(cPCT = paste0(100 * cumsum(Frequency) / sum(Frequency), "%"))# #Sorted Bar Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P01 <- bb %>% mutate(across(softdrink, factor, levels = rev(unique(softdrink)))) %>% {
ggplot(data = ., aes(x = softdrink, y = Frequency)) +
geom_bar(stat = 'identity', aes(fill = softdrink)) +
scale_y_continuous(sec.axis = sec_axis(~ (. / sum(bb$Frequency))*100, name = "Percentages",
labels = function(b) { paste0(round(b, 0), "%")})) +
geom_text(aes(label = paste0(Frequency, "\n(", PCT, ")")), vjust = 2,
colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
k_gglayer_bar +
labs(x = "Soft Drinks", y = "Frequency", subtitle = NULL,
caption = "C02P01", title = "Bar Chart of Categorical Data")
}# #Pie Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P02 <- bb %>% mutate(across(softdrink, factor, levels = unique(softdrink))) %>% {
ggplot(data = ., aes(x = '', y = Frequency, fill = rev(softdrink))) +
geom_bar(stat = 'identity', width = 1, color = "white") +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = paste0(softdrink, "\n", Frequency, " (", PCT, ")")),
position = position_stack(vjust = 0.5),
colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
k_gglayer_pie +
labs(caption = "C02P02", title = "Pie Chart of Categorical Data")
}f_theme_gg <- function(base_size = 14) {
# #Create a Default Theme
theme_bw(base_size = base_size) %+replace%
theme(
# #The whole figure
plot.title = element_text(size = rel(1), face = "bold",
margin = margin(0,0,5,0), hjust = 0),
# #Area where the graph is located
panel.grid.minor = element_blank(),
panel.border = element_blank(),
# #The axes
axis.title = element_text(size = rel(0.85), face = "bold"),
axis.text = element_text(size = rel(0.70), face = "bold"),
# arrow = arrow(length = unit(0.3, "lines"), type = "closed"),
axis.line = element_line(color = "black"),
# The legend
legend.title = element_text(size = rel(0.85), face = "bold"),
legend.text = element_text(size = rel(0.70), face = "bold"),
legend.key = element_rect(fill = "transparent", colour = NA),
legend.key.size = unit(1.5, "lines"),
legend.background = element_rect(fill = "transparent", colour = NA),
# Labels in the case of facetting
strip.background = element_rect(fill = "#17252D", color = "#17252D"),
strip.text = element_text(size = rel(0.85), face = "bold", color = "white", margin = margin(5,0,5,0))
)
}# #Change default ggplot2 theme
theme_set(f_theme_gg())
#
# #List of Specific sets. Note '+' is replaced by ','
k_gglayer_bar <- list(
scale_fill_viridis_d(),
theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=1),
legend.position = 'none', axis.title.y.right = element_blank())
)
#
# #Pie
k_gglayer_pie <- list(
scale_fill_viridis_d(),
#theme_void(),
theme(#panel.background = element_rect(fill = "white", colour = "white"),
#plot.background = element_rect(fill = "white",colour = "white"),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
#panel.border = element_rect(colour = "black", fill=NA, size=1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = 'none')
)
#
# #Histogram
k_gglayer_hist <- list(
scale_fill_viridis_c(direction = -1, alpha = 0.9),
theme(panel.grid.major.x = element_blank(), axis.line.y = element_blank(),
panel.border = element_blank(), axis.ticks.y = element_blank(),
legend.position = 'none')
)
#
# #Scatter Plot Trendline
k_gglayer_scatter <- list(
scale_fill_viridis_d(alpha = 0.9),
theme(panel.grid.minor = element_blank(),
panel.border = element_blank())
)
#
# #BoxPlot
k_gglayer_box <- list(
scale_fill_viridis_d(alpha = 0.9),
theme(panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(), panel.grid.major.x = element_blank(),
#plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
#text=element_text(family = "Tahoma"),
#axis.title = element_text(face="bold"),
#axis.text.x = element_text(colour="black", size = 11),
#axis.text.y = element_text(colour="black", size = 9),
axis.line = element_line(size=0.5, colour = "black"))
)Solution: Use geom_bar(stat = "identity")
A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables.
The three steps necessary to define the classes for a frequency distribution with quantitative data are
set.seed(3)
# #Get Normal Data
bb <- tibble(aa = rnorm(n = 10000)) # #Histogram
# # '..count..' or '..x..'
C02P03 <- bb %>% {
ggplot(data = ., aes(x = aa, fill = ..count..)) +
geom_histogram(bins = 50, position = "identity") +
k_gglayer_hist +
labs(x = "Normal Data", y = "Count", subtitle = paste0("n = ", format(nrow(.), big.mark = ",")),
caption = "C02P03", title = "Histogram")
}Figure 10.2 geom_histogram(): Histogram
# #Random Data
aa <- c(26, 35, 22, 47, 37, 5, 50, 49, 42, 2, 8, 7, 4, 47, 44, 35, 17, 49, 1, 48,
1, 27, 13, 26, 18, 44, 31, 4, 23, 47, 38, 28, 28, 5, 35, 39, 29, 13, 17,
38, 1, 8, 3, 30, 18, 37, 29, 39, 7, 28)bb <- tibble(aa)
# #Dot Chart of Frequencies
C02P04 <- bb %>% {
ggplot(., aes(x = aa)) +
geom_dotplot(binwidth = 5, method = "histodot") +
theme(axis.line.y = element_blank(), panel.grid = element_blank(), axis.text.y = element_blank(),
axis.ticks.y = element_blank(), axis.title.y = element_blank()) +
labs(x = "Bins", subtitle = "Caution: Avoid! Y-Axis is deceptive.",
caption = "C02P04", title = "Dot Plot")
}Figure 10.3 geom_dotplot(): Frequency Dot Chart
|
|
|
# #Judges: Because we are evaluating 'Judges', they are the 'elements' and thus are the 'rows'
xxJudges <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
# #Uaggregated crosstab for both Judges in different types of Courts
xxKen <- tibble(Ken = c("Common", "Municipal "),
Upheld = c(90, 20), Reversed = c(10, 5))
xxAbel <- tibble(Abel = c("Common", "Municipal "),
Upheld = c(29, 100), Reversed = c(3, 18))# #Judges
aa <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
bb <- tibble(Verdict_Judge = c('Upheld', 'Reversed'), Abel = c(129, 21), Ken = c(110, 15))
aa
## # A tibble: 2 x 3
## Judge_Verdict Upheld Reversed
## <chr> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
# #Transpose, Assuming First Column Header has "Row_Col" Type Format
ii <- aa %>%
`attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>%
`attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>%
pivot_longer(c(-1),
names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost),
values_to = "Values") %>%
pivot_wider(names_from = 1, values_from = Values) %>%
`attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL)
stopifnot(identical(bb, ii))
ii
## # A tibble: 2 x 3
## Verdict_Judge Abel Ken
## <chr> <dbl> <dbl>
## 1 Upheld 129 110
## 2 Reversed 21 15
# #Testing for Reverse
ii <- bb %>%
`attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>%
`attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>%
pivot_longer(c(-1),
names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost),
values_to = "Values") %>%
pivot_wider(names_from = 1, values_from = Values) %>%
`attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL)
stopifnot(identical(aa, ii))bb <- "Judge_Verdict"
# #Split String by strsplit(), output is list
(ii <- unlist(strsplit(bb, "_")))
## [1] "Judge" "Verdict"
#
# #Split on Dot
bb <- "Judge.Verdict"
# #Using character classes
ii <- unlist(strsplit(bb, "[.]"))
# #By escaping special characters
jj <- unlist(strsplit(bb, "\\."))
# #Using Options
kk <- unlist(strsplit(bb, ".", fixed = TRUE))
stopifnot(all(identical(ii, jj), identical(ii, kk)))jj <- ii <- bb <- aa
# #attr() adds or removes an attribute
attr(bb, "NewOne") <- "abc"
# #Using Backticks
ii <- `attr<-`(ii, "NewOne", "abc")
# #Using Pipe
jj <- jj %>% `attr<-`("NewOne", "abc")
#
stopifnot(all(identical(bb, ii), identical(bb, jj)))
#
# #List Attributes
names(attributes(bb))
## [1] "class" "row.names" "names" "NewOne"
#
# #Specific Attribute Value
attributes(bb)$NewOne
## [1] "abc"
#
# #Remove Attributes
attr(bb, "NewOne") <- NULL
ii <- `attr<-`(ii, "NewOne", NULL)
jj <- jj %>% `attr<-`("NewOne", NULL)
stopifnot(all(identical(bb, ii), identical(bb, jj)))# #(Deprecated) Issues:
# #(1) bind_rows() needs two dataframes. Thus, first can be skipped in Pipe, But...
# #The second dataframe cannot be replaced with dot (.), it has to have a name
# #(2) Pipe usage inside function call was working but was a concern
# #(3) It introduced NA for that replace was needed as another step
ii <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
mutate(across(1, ~ replace(., . %in% NA, "Total")))
#
# #(Deprecated) Works but needs ALL Column Names individually
jj <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.[, 2]), Reversed = sum(.[, 3]))
kk <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.$Upheld), Reversed = sum(.$Reversed))
#
# #(Deprecated) Removed the Multiple call to sum(). However, it needs First Column Header Name
ll <- aa %>% add_row(Judge_Verdict = "Total", summarise(., across(where(is.numeric), sum)))
# #(Deprecated) Replaced Column Header Name using "Tilde"
mm <- aa %>% add_row(summarise(., across(where(is.character), ~"Total")),
summarise(., across(where(is.numeric), sum)))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #(Working): Minimised
aa %>% add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum)))
## # A tibble: 3 x 3
## Judge_Verdict Upheld Reversed
## <chr> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
## 3 Total 239 36# # USE '%in%' for NA, otherwise '==' works
bb <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum)))
#
ii <- bb %>% mutate(across(1, ~ replace(., . %in% NA, "Total")))
mm <- bb %>% mutate(across(1, ~ replace(., is.na(.), "Total")))
jj <- bb %>% mutate(Judge_Verdict = replace(Judge_Verdict, is.na(Judge_Verdict), "Total"))
kk <- bb %>% mutate(across(1, coalesce, "Total"))
ll <- bb %>% mutate(across(1, ~ replace_na(.x, "Total")))
nn <- bb %>% mutate(across(1, replace_na, "Total"))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll),
identical(ii, mm), identical(ii, nn)))
#
# #Replace NA in a Factor
bb %>%
mutate(Judge_Verdict = factor(Judge_Verdict)) %>%
mutate(across(1, fct_explicit_na, na_level = "Total"))
## # A tibble: 3 x 3
## Judge_Verdict Upheld Reversed
## <fct> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
## 3 Total 239 36# #Convert to Factor
aa %>% mutate(Judge_Verdict = factor(Judge_Verdict))
## # A tibble: 2 x 3
## Judge_Verdict Upheld Reversed
## <fct> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15# #Paste but do not execute
aa <- read_delim(clipboard())
# #Copy Excel Data, then execute the above command
#
# #Print its structure
dput(aa)
# #Copy the relevant values, headers in tibble()
bb <- tibble( )
# #The above command will be the setup to generate this tibble anywhereex27 <- tibble(Observation = 1:30,
x = c("A", "B", "B", "C", "B", "C", "B", "C", "A", "B", "A", "B", "C", "C", "C",
"B", "C", "B", "C", "B", "C", "B", "C", "A", "B", "C", "C", "A", "B", "B"),
y = c(1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2,
2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2))bb <- ex27
str(bb)
## tibble [30 x 3] (S3: tbl_df/tbl/data.frame)
## $ Observation: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
## $ x : chr [1:30] "A" "B" "B" "C" ...
## $ y : num [1:30] 1 1 1 2 1 2 1 2 1 1 ...
# #Create CrossTab
bb <- bb %>%
count(x, y) %>%
pivot_wider(names_from = y, values_from = n, values_fill = 0)bb
## # A tibble: 3 x 3
## x `1` `2`
## <chr> <int> <int>
## 1 A 5 0
## 2 B 11 2
## 3 C 2 10
# #Rowwise Percentage in Separate New Columns
bb %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1), .names = "{.col}_Row" ))
## # A tibble: 3 x 7
## x `1` `2` SUM `1_Row` `2_Row` SUM_Row
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A 5 0 5 100 0 100
## 2 B 11 2 13 84.6 15.4 100
## 3 C 2 10 12 16.7 83.3 100
#
# #Rowwise Percentage in Same Columns
bb %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 3 x 4
## x `1` `2` SUM
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 0 100
## 2 B 84.6 15.4 100
## 3 C 16.7 83.3 100
#
# #Equivalent
bb %>%
mutate(SUM = rowSums(across(where(is.numeric))),
across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 3 x 4
## x `1` `2` SUM
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 0 100
## 2 B 84.6 15.4 100
## 3 C 16.7 83.3 100
#
# #Columnwise Percentage in Separate New Columns
bb %>%
mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1), .names = "{.col}_Col" ))
## # A tibble: 3 x 5
## x `1` `2` `1_Col` `2_Col`
## <chr> <int> <int> <dbl> <dbl>
## 1 A 5 0 27.8 0
## 2 B 11 2 61.1 16.7
## 3 C 2 10 11.1 83.3
# #Columnwise Percentage in Same Columns
bb %>%
mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1)))
## # A tibble: 3 x 3
## x `1` `2`
## <chr> <dbl> <dbl>
## 1 A 27.8 0
## 2 B 61.1 16.7
## 3 C 11.1 83.3ex28 <- tibble(Observation = 1:20,
x = c(28, 17, 52, 79, 37, 71, 37, 27, 64, 53, 13, 84, 59, 17, 70, 47, 35, 62, 30, 43),
y = c(72, 99, 58, 34, 60, 22, 77, 85, 45, 47, 98, 21, 32, 81, 34, 64, 68, 67, 39, 28))bb <- ex28
# #Rounding to the lowest 10s before min and to the highest 10s after max
nn <- 10L
n_x <- seq(floor(min(bb$x) / nn) * nn, ceiling(max(bb$x) / nn) * nn, by = 20)
n_y <- seq(floor(min(bb$y) / nn) * nn, ceiling(max(bb$y) / nn) * nn, by = 20)
#
# #Labels in the format of [10-29]
lab_x <- paste0(n_x, "-", n_x + 20 - 1) %>% head(-1)
lab_y <- paste0(n_y, "-", n_y + 20 - 1) %>% head(-1)
# #Wider Table without Totals
ii <- bb %>%
mutate(x_bins = cut(x, breaks = n_x, right = FALSE, labels = lab_x),
y_bins = cut(y, breaks = n_y, right = FALSE, labels = lab_y)) %>%
count(x_bins, y_bins) %>%
pivot_wider(names_from = y_bins, values_from = n, values_fill = 0, names_sort = TRUE)
print(ii)
## # A tibble: 4 x 5
## x_bins `20-39` `40-59` `60-79` `80-99`
## <fct> <int> <int> <int> <int>
## 1 10-29 0 0 1 4
## 2 30-49 2 0 4 0
## 3 50-69 1 3 1 0
## 4 70-89 4 0 0 0
# #Cross Tab with Total Column and Total Row
jj <- ii %>%
bind_rows(ii %>% summarise(across(where(is.numeric), sum))) %>%
mutate(across(1, fct_explicit_na, na_level = "Total")) %>%
mutate(SUM = rowSums(across(where(is.numeric))))
print(jj)
## # A tibble: 5 x 6
## x_bins `20-39` `40-59` `60-79` `80-99` SUM
## <fct> <int> <int> <int> <int> <dbl>
## 1 10-29 0 0 1 4 5
## 2 30-49 2 0 4 0 6
## 3 50-69 1 3 1 0 5
## 4 70-89 4 0 0 0 4
## 5 Total 7 3 6 4 20# #Group Continuous Data to Categorical Bins by base::cut()
bb <- ex28
#
# #NOTE cut() increases the range slightly but ggplot functions do not
bb %>% mutate(x_bins = cut(x, breaks = 8)) %>%
pull(x_bins) %>% levels()
## [1] "(12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #By default, it excludes the lower range, but it can be included by option
bb %>% mutate(x_bins = cut(x, breaks = 8, include.lowest = TRUE)) %>%
pull(x_bins) %>% levels()
## [1] "[12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #ggplot::cut_interval() makes n groups with equal range. There is a cut_number() also
bb %>% mutate(x_bins = cut_interval(x, n = 8)) %>%
pull(x_bins) %>% levels()
## [1] "[13,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84]"
#
# #Specific Bins
bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10))) %>%
pull(x_bins) %>% levels()
## [1] "(10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
ii <- bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10), include.lowest = TRUE)) %>%
pull(x_bins) %>% levels()
print(ii)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
#
# #ggplot::cut_width() makes groups of width
bb %>% mutate(x_bins = cut_width(x, width = 10)) %>%
pull(x_bins) %>% levels()
## [1] "[5,15]" "(15,25]" "(25,35]" "(35,45]" "(45,55]" "(55,65]" "(65,75]" "(75,85]"
#
# #Match cut_width() and cut()
jj <- bb %>% mutate(x_bins = cut_width(x, width = 10, boundary = 0)) %>%
pull(x_bins) %>% levels()
print(jj)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
stopifnot(identical(ii, jj))
#
# #Labelling
n_breaks <- seq(10, 90, by = 10)
n_labs <- paste0("*", n_breaks, "-", n_breaks + 10) %>% head(-1)
bb %>% mutate(x_bins = cut(x, breaks = n_breaks, include.lowest = TRUE, labels = n_labs)) %>%
pull(x_bins) %>% levels()
## [1] "*10-20" "*20-30" "*30-40" "*40-50" "*50-60" "*60-70" "*70-80" "*80-90"xxCommercials <- tibble(Week = 1:10,
Commercials = c(2, 5, 1, 3, 4, 1, 5, 3, 4, 2),
Sales = c(50, 57, 41, 54, 54, 38, 63, 48, 59, 46))
f_setRDS(xxCommercials)Figure 10.4 geom_point(), geom_smooth(), & stat_poly_eq()
bb <- xxCommercials
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline alongwith its equation and R2 value
C02P05 <- bb %>% {
ggplot(data = ., aes(x = Commercials, y = Sales)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_point() +
labs(x = "Commercials", y = "Sales ($100s)",
subtitle = paste0("Trendline equation and R", '\u00b2', " value"),
caption = "C02P05", title = "Scatter Plot")
}9.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.
Individual numbers can be represented by symbols, called numerals; for example, “5” is a numeral that represents the ‘number five.’
As only a relatively small number of symbols can be memorized, basic numerals are commonly organized in a numeral system, which is an organized way to represent any number. The most common numeral system is the Hindu–Arabic numeral system, which allows for the representation of any number using a combination of ten fundamental numeric symbols, called digits.
Counting is the process of determining the number of elements of a finite set of objects, i.e., determining the size of a set. Enumeration refers to uniquely identifying the elements of a set by assigning a number to each element.
Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.
Formally, \(\mathbb{N} \to \mathbb{Z} \to \mathbb{Q} \to \mathbb{R} \to \mathbb{C}\)
Practically, \(\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}\)
The natural numbers \(\mathbb{N}\) are those numbers used for counting and ordering. ISO standard begin the natural numbers with 0, corresponding to the non-negative integers \(\mathbb{N} = \{0, 1, 2, 3, \ldots \}\), whereas others start with 1, corresponding to the positive integers \(\mathbb{N^*} = \{1, 2, 3, \ldots \}\)
The set of integers \(\mathbb{Z}\) consists of zero (\({0}\)), the positive natural numbers \(\{1, 2, 3, \ldots \}\) and their additive inverses (the negative integers). Thus i.e., \(\mathbb{Z} = \{\ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}\). An integer is colloquially defined as a number that can be written without a fractional component.
Rational numbers \(\mathbb{Q}\) are those which can be expressed as the quotient or fraction p/q of two integers, a numerator p and a non-zero denominator q. Thus, Rational Numbers \(\mathbb{Q} = \{1, 2, 3, \ldots \}\)
A real number is a value of a continuous quantity that can represent a distance along a line. The real numbers include all the rational numbers \(\mathbb{Q}\), and all the irrational numbers. Thus, Real Numbers \(\mathbb{R} = \mathbb{Q} \cup \{\sqrt{2}, \sqrt{3}, \ldots\} \cup \{ \pi, e, \phi, \ldots \}\)
The complex numbers \(\mathbb{C}\) contain numbers which are expressed in the form \(a + ib\), where \({a}\) and \({b}\) are real numbers. These have two components the real numbers and a specific element denoted by \({i}\) (imaginary unit) which satisfies the equation \(i^2 = −1\).
The number Pi \(\pi = 3.14159\ldots\) is defined as the ratio of circumference of a circle to its diameter.
\[\pi = \int _{-1}^{1} \frac{dx}{\sqrt{1- x^2}} \tag{11.1}\]
\[e^{i\varphi}=\cos \varphi + i\sin \varphi \tag{11.2}\]
\[e^{i\pi} + 1 = 0 \tag{11.3}\]
# #Read OIS File for 20000 PI digits including integral (3) and fractional (14159...)
# #md5sum = "daf0b33a67fd842a905bb577957a9c7f"
tbl <- read_delim(file = paste0(.z$XL, "PI-OIS-b000796.txt"),
delim = " ", col_names = c("POS", "VAL"), col_types = list(POS = "i", VAL = "i"))
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxPI <- tbl
f_setRDS(xxPI)Euler Number \(e = 2.71828\ldots\), is the base of the natural logarithm.
\[e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^{n} = \sum \limits_{n=0}^{\infty} \frac{1}{n!} \tag{11.4}\]
Two quantities are in the golden ratio \(\varphi = 1.618\ldots\) if their ratio is the same as the ratio of their sum to the larger of the two quantities.
\[\varphi^2 - \varphi - 1 = 0 \\ \varphi = \frac{1 + \sqrt{5}}{2} \tag{11.5}\]
Mersenne primes:
# #Create empty Vector with NA
aa <- rep(NA_integer_, 10)
print(aa)
## [1] NA NA NA NA NA NA NA NA NA NAf_isPrime <- function(x) {
# #Check if the number is Prime
if(!is.integer(x)) {
cat("Error! Integer required. \n")
stop()
} else if(x <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(x > 2147483647L) {
cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
stop()
}
# #However, this checks the number against ALL Smaller values including non-primes
if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
# # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
return(TRUE)
} else {
## (any(x %% 2L:ceiling(sqrt(x)) == 0))
## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
return(FALSE)
}
}
# #Vectorise Version
f_isPrimeV <- Vectorize(f_isPrime)
# #Compiled Version
f_isPrimeC <- cmpfun(f_isPrime)# #There are 4 Primes in First 10, 25 in 100, 168 in 1000, 1229 in 10000.
# # Using Vectorise Version, get all the Primes
aa <- 1:10
bb <- aa[f_isPrimeV(aa)]
ii <- f_getPrimeUpto(10)
stopifnot(identical(bb, ii))
# #
xxPrime10 <- c(2, 3, 5, 7) |> as.integer()
# #
xxPrime100 <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97) |> as.integer()
#
# #Generate List of ALL Primes till 524287 (i.e. Total 43,390 Primes)
xxPrimes <- f_getPrimeUpto(524287L)
# #Save as RDS
f_setRDS(xxPrimes)# #NOTE: Assigning 2147483647L causes the Chunk to throw Warnings even with 'eval=FALSE'.
if(FALSE){
# #Assignment of 2305843009213693951L is NOT possible without Warning
# #Even within non-executing Block or with 'eval=FALSE' or suppressWarnings() or tryCatch()
# #It cannot be stored as integer, thus it is automatically converted to double
#bb <- 2305843009213693951L
# #Warning: non-integer value 2305843009213693951L qualified with L; using numeric value
# #NOTE that the value changed. It is explicitly NOT a prime anymore.
#print(aa, digits = 20)
# #[1] 2305843009213693952
#
# #Assignment of 2147483647L is possible and direct printing in console works BUT
# #Its printing will also throw Warnings that are difficult to handle
# #Avoid Printing. Even within non-executing Block, it is affecting R Bookdown.
aa <- 2147483647L
#print(aa)
}f_getPrimeUpto <- function(x){
# #Get a Vector of Primes upto the given Number (Max. 524287)
if(x < 2) {
print("NOT ALLOWED!")
return(NULL)
} else if(x > 524287){
print("Sadly, beyond this number it is difficult to generate the List of Primes!")
return(NULL)
}
y <- 2:x
i <- 1
while (y[i] <= sqrt(x)) {
y <- y[y %% y[i] != 0 | y == y[i]]
i <- i+1
}
return(y)
}# #Compare any number of functions
result <- microbenchmark(
sum(1:100)/length(1:100),
mean(1:100),
#times = 1000,
check = 'identical'
)
# #Print Table
print(result)
##Unit: microseconds
## expr min lq mean median uq max neval cld
## sum(1:100)/length(1:100) 1.2 1.301 1.54795 1.5005 1.6005 7.501 100 a
## mean(1:100) 5.9 6.001 6.56989 6.1010 6.2010 28.001 100 b
#
# #Boxplot of Benchmarking Result
#autoplot(result)
# #Above testcase showed a surprising result of sum()/length() being much faster than mean()
#
# #Or Compare Plot Rendering
if(FALSE) microbenchmark(print(jj), print(kk), print(ll), times = 2)“ForLater” - Include rowsum(), rowSums(), colSums(), rowMeans(), colMeans() in this also.
# #Conclusion: use mean() because precision is difficult to achieve compared to speed
#
# #sum()/length() is faster than mean()
# #However, mean() does double pass, so it would be more accurate
# #mean.default() and var() compute means with an additional pass and so are more accurate
# #e.g. the variance of a constant vector is (almost) always zero
# #and the mean of such a vector will be equal to the constant value to machine precision.
aa <- 1:100
#
microbenchmark(
sum(aa)/length(aa),
mean(aa),
mean.default(aa),
.Internal(mean(aa)),
#times = 1000,
check = 'identical'
)
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## sum(aa)/length(aa) 400 400 523 500 500 3400 100 a
## mean(aa) 2500 2700 3001 2700 2800 20500 100 c
## mean.default(aa) 1300 1400 1551 1500 1500 6600 100 b
## .Internal(mean(aa)) 500 500 593 600 600 4300 100 a
# #rnorm() generates random deviates of given length
set.seed(3)
aa <- rnorm(1e7)
str(aa)
## num [1:10000000] -0.962 -0.293 0.259 -1.152 0.196 ...
#
# #NOTE manual calculation and mean() is NOT matching
identical(sum(aa)/length(aa), mean(aa))
## [1] FALSE
#
# #There is a slight difference
sum(aa)/length(aa) - mean(aa)
## [1] 2.355429e-17if(FALSE) {
# #Remove all objects matching a pattern
rm(list = ls(pattern = "f_"))
}# #Check the Current Options Value
getOption("expressions")
## [1] 5000
if(FALSE) {
# #Change Value
# #NOTE it did not help when recursive function failed
# #Error: node stack overflow
# #Error during wrapup: node stack overflow
# #Error: no more error handlers available ...
options(expressions=10000)
}# #To Vectorise a Function
f_isPrimeV <- Vectorize(f_isPrime)# #To Pre-Compile a Function for faster performance
f_isPrimeC <- cmpfun(f_isPrime)# #To Profile a Function Calls for improvements
Rprof("file.out")
f_isPrime(2147483647L)
#f_getPrimesUpto(131071L)
Rprof(NULL)
summaryRprof("file.out")# #Functions to check for PRIME - All of them have various problems
# #"-3L -2L -1L 0L 1L 8L" FALSE "2L 3L ... 524287L 2147483647L" TRUE
isPrime_a <- function(x) {
# #Fails for "2147483647L" Error: cannot allocate vector of size 8.0 Gb
if (x == 2L) {
return(TRUE)
} else if (any(x %% 2:(x-1) == 0)) {
return(FALSE)
} else return(TRUE)
}
isPrime_b <- function(x){
# #Comparison of Division and Integer Division by 1, 2, ..., x
# #Fails for "2147483647L" Error: cannot allocate vector of size 16.0 Gb
# #Fails for "-ve and zero" Error: missing value where TRUE/FALSE needed
# vapply(x, function(y) sum(y / 1:y == y %/% 1:y), integer(1L)) == 2L
if(sum(x / 1:x == x %/% 1:x) == 2) {
return(TRUE)
} else return(FALSE)
}
isPrime_c <- function(x) {
# #RegEx Slowest: Iit converts -ve values and coerce non-integers which may result in bugs
x <- abs(as.integer(x))
if(x > 8191L) {
print("Do not run this with large values. RegEx is really slow.")
stop()
}
!grepl('^1?$|^(11+?)\\1+$', strrep('1', x))
}
isPrime_d <- function(x) {
# #Fails for "1" & returns TRUE
# #Fails for "-ve and zero" Error: NA/NaN argument
if(x == 2L || all(x %% 2L:max(2, floor(sqrt(x))) != 0)) {
return(TRUE)
} else return(FALSE)
}
isPrime_e <- function(x) {
# #Fails for "-ve and zero" Error: NA/NaN argument
# #This is the most robust which can be improved by conditional check for positive integers
# #However, this checks the number against ALL Smaller values including non-primes
if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
# # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
return(TRUE)
} else {
## (any(x %% 2L:ceiling(sqrt(x)) == 0))
## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
return(FALSE)
}
}# #131071 (12,251th), 524287 (43,390th), 2147483647 (105,097,565th)
aa <- 1:131071
# #Following works but only till 524287L, Memory Overflow ERROR for 2147483647L
bb <- aa[f_isPrimeV(aa)]
getPrimeUpto_a <- function(x){
# #Extremely slow, cannot go beyond 8191L in benchmark testing
if(x < 2) return("ERROR")
y <- 2:x
primes <- rep(2L, x)
j <- 1L
for (i in y) {
if (!any(i %% primes == 0)) {
j <- j + 1L
primes[j] <- i
#cat(paste0("i=", i, ", j=", j, ", Primes= ", paste0(head(primes, j), collapse = ", ")))
}
#cat("\n")
}
result <- head(primes, j)
#str(result)
#cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
#cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_b <- function(x){
# #https://stackoverflow.com/questions/3789968/
# #This is much faster even from the "aa[f_isPrimeV(aa)]"
if(x < 2) return("ERROR")
y <- 2:x
i <- 1
while (y[i] <= sqrt(x)) {
y <- y[y %% y[i] != 0 | y == y[i]]
i <- i+1
}
result <- y
#str(result)
#cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
#cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_c <- function(x) {
# #Problems and Slow
# #Returns a Vetor of Primes till the Number i.e. f_getPrimesUpto(7) = (2, 3, 5, 7)
# #NOTE: f_getPrimesUpto(1) and f_getPrimesUpto(2) both return "2"
if(!is.integer(x)) {
cat("Error! Integer required. \n")
stop()
} else if(!identical(1L, length(x))) {
cat("Error! Unit length vector required. \n")
stop()
} else if(x <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(x > 2147483647) {
cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
stop()
}
# #Cannot create vector of length 2147483647L and also not needed that many
# #ceiling(sqrt(7L)) return 3, however we need length 4 (2, 3, 5, 7)
# #So, added extra 10
#primes <- rep(NA_integer_, 10L + sqrt(2L))
primes <- rep(2L, 10L + sqrt(2L))
j <- 1L
primes[j] <- 2L
#
i <- 2L
while(i <= x) {
# #na.omit() was the slowest step, so changed all NA to 2L in the primes
#k <- na.omit(primes[primes <= ceiling(sqrt(i))])
k <- primes[primes <= ceiling(sqrt(i))]
if(all(as.logical(i %% k))) {
j <- j + 1
primes[j] <- i
}
# #Increment with INTEGER Addition
i = i + 1L
}
result <- primes[complete.cases(primes)]
str(result)
cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_d <- function(n = 10L, i = 2L, primes = c(2L), bypass = TRUE){
# #Using Recursion is NOT a good solution
# #Function to return N Primes upto 1000 Primes (7919) or Max Value reaching 10000.
if(i > 10000){
cat("Reached 10000 \n")
return(primes)
}
if(bypass) {
maxN <- 1000L
if(!is.integer(n)) {
cat("Error! Integer required. \n")
stop()
} else if(!identical(1L, length(n))) {
cat("Error! Unit length vector required. \n")
stop()
} else if(n <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(n > maxN) {
cat(paste0("Error! This will calculate only upto ", maxN, " prime Numebers. \n"))
stop()
}
}
if(length(primes) < n) {
if(all(as.logical(i %% primes[primes <= ceiling(sqrt(i))]))) {
# #Coercing 0 to FALSE, Non-zero Values to TRUE
# # "i %% 2L:ceiling(sqrt(i))" checks i agains all integers till sqrt(i)
# # "primes[primes <= ceiling(sqrt(i))]" checks i against only the primes till sqrt(i)
# #However, the above needs hardcoded 2L as prime so the vector is never empty
# #Current Number is Prime, so include it in the vector and check the successive one
f_getPrime(n, i = i+1, primes = c(primes, i), bypass = FALSE)
} else {
# #Current Number is NOT Prime, so check the successive one
f_getPrime(n, i = i+1, primes = primes, bypass = FALSE)
}
} else {
# #Return the vector when it reaches the count
return(primes)
}
}\[\overline{x} = \frac{1}{n}\left (\sum_{i = 1}^n{{x}_i}\right ) = \frac{{x}_1 + {x}_2 + \cdots + {x}_n}{n} \tag{11.6}\]
In the mean calculation, normally each \({{x}_i}\) is given equal importance or weightage of \({1/n}\). However, in some instances the mean is computed by giving each observation a weight that reflects its relative importance. A mean computed in this manner is referred to as the weighted mean, as given in equation (11.7)
\[\overline{x} = \frac{\sum_{i=1}^n{w_ix_i}}{\sum_{i=1}^n{w_i}} \tag{11.7}\]
Caution: Unit of mean is same as unit of the variable e.g. cost_per_kg thus ‘w’ would be ‘kg.’
aa <- 1:10
# #Mean of First 10 Numbers
mean(aa)
## [1] 5.5aa <- 1:10
# #Mean of First 10 Numbers
ii <- mean(aa)
print(ii)
## [1] 5.5
jj <- sum(aa)/length(aa)
stopifnot(identical(ii, jj))
#
# #Mean of First 10 Prime Numbers (is neither Prime nor Integer)
mean(f_getRDS(xxPrimes)[1:10])
## [1] 12.9
#
# #Mean of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% mean()
## [1] 4.71aa <- tibble(Purchase = 1:5, cost_per_kg = c(3, 3.4, 2.8, 2.9, 3.25),
kg = c(1200, 500, 2750, 1000, 800))
# #NOTE that unit of mean is same as unit of the variable e.g. cost_per_kg thus 'w' would be 'kg'
(ii <- sum(aa$cost_per_kg * aa$kg)/sum(aa$kg))
## [1] 2.96
jj <- with(aa, sum(cost_per_kg * kg)/sum(kg))
kk <- weighted.mean(x = aa$cost_per_kg, w = aa$kg)
stopifnot(all(identical(ii, jj), identical(ii, kk)))\[\begin{align} \text{if n is odd, } median(x) & = x_{(n + 1)/2} \\ \text{if n is even, } median(x) & = \frac{x_{(n/2)} + x_{(n/2) + 1}}{2} \end{align} \tag{11.8}\]
aa <- 1:10
# #Median of First 10 Numbers
median(aa)
## [1] 5.5aa <- 1:10
# #Median of First 10 Numbers
median(aa)
## [1] 5.5
#
# #Median of First 10 Prime Numbers (is NOT prime)
median(f_getRDS(xxPrimes)[1:10])
## [1] 12
#
# #Median of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% median()
## [1] 4.5\[\overline{x}_g = \left(\prod _{i=1}^{n} x_i\right)^{\frac{1}{n}} = \sqrt[{n}]{x_1 x_2 \ldots x_n} \tag{11.9}\]
aa <- 1:10
# #Geometric Mean of of First 10 Numbers
exp(mean(log(aa)))
## [1] 4.528729aa <- 1:10
# #Geometric Mean of of First 10 Numbers
ii <- exp(mean(log(aa)))
jj <- prod(aa)^(1/length(aa))
stopifnot(identical(ii, jj))
#
# #Geometric Mean of First 10 Prime Numbers
exp(mean(log(f_getRDS(xxPrimes)[1:10])))
## [1] 9.573889# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ] %>% pull(VAL)
f_getMode(bb)
## [1] 9# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #Get Frequency
bb %>% count(VAL)
## # A tibble: 10 x 2
## VAL n
## <int> <int>
## 1 0 8
## 2 1 8
## 3 2 12
## 4 3 12
## 5 4 10
## 6 5 8
## 7 6 9
## 8 7 8
## 9 8 12
## 10 9 13
#
# #Get Mode
bb %>% pull(VAL) %>% f_getMode()
## [1] 9f_getMode <- function(x) {
# #Calculate Statistical Mode
# #NOTE: Single Length, All NA, Characters etc. have NOT been validated
# #https://stackoverflow.com/questions/56552709
# #https://stackoverflow.com/questions/2547402
# #Remove NA
if (anyNA(x)) {
x <- x[!is.na(x)]
}
# #Get Unique Values
ux <- unique(x)
# #Match
ux[which.max(tabulate(match(x, ux)))]
}type =6 option of quantile(), default is type =7\[L_p = \frac{p}{100}(n + 1) \tag{11.10}\]
# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
quantile(bb$VAL, 0.5)
## 50%
## 4.5# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
ii <- quantile(bb$VAL, 0.5)
print(ii)
## 50%
## 4.5
jj <- median(bb$VAL)
stopifnot(identical(unname(ii), jj))
#
# #All Quartiles
quantile(bb$VAL, seq(0, 1, 0.25))
## 0% 25% 50% 75% 100%
## 0.00 2.00 4.50 7.25 9.00
# #summary()
summary(bb$VAL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 4.50 4.71 7.25 9.00
#
# #To Match with Excel "PERCENTILE.EXC" use type=6 in place of default type=7
quantile(bb$VAL, seq(0, 1, 0.25), type = 6)
## 0% 25% 50% 75% 100%
## 0.00 2.00 4.50 7.75 9.00In addition to measures of location, it is often desirable to consider measures of variability, or dispersion.
range()
max() - min()IQR()
\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align} \tag{11.11}\]
\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align} \tag{11.12}\]
Refer figure 11.1
Skewness is given by the equation (11.13), which is being shown here because it looked cool has deep meaning
\[Skew = \frac{\tfrac {1}{n}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{3}}{\left[\tfrac {1}{n-1}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{2} \right]^{3/2}} \tag{11.13}\]
Figure 11.1 (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)
# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
even_skew <- c(49, 50, 51)
pos_skew <- c(even_skew, 60)
neg_skew <- c(even_skew, 40)
skew_lst <- list(even_skew, pos_skew, neg_skew)
# #Mean, Median, SD
cat(paste0("Mean (neg, even, pos): ",
paste0(vapply(skew_lst, mean, numeric(1)), collapse = ", "), "\n"))
## Mean (neg, even, pos): 50, 52.5, 47.5
cat(paste0("Median (neg, even, pos): ",
paste0(vapply(skew_lst, median, numeric(1)), collapse = ", "), "\n"))
## Median (neg, even, pos): 50, 50.5, 49.5
cat(paste0("SD (neg, even, pos): ", paste0(
round(vapply(skew_lst, sd, numeric(1)), 1), collapse = ", "), "\n"))
## SD (neg, even, pos): 1, 5.1, 5.1
#
cat(paste0("Skewness (neg, even, pos): ", paste0(
round(vapply(skew_lst, e1071::skewness, numeric(1)), 1), collapse = ", "), "\n"))
## Skewness (neg, even, pos): 0, 0.7, -0.7
cat(paste0("Kurtosis (neg, even, pos): ", paste0(
round(vapply(skew_lst, e1071::kurtosis, numeric(1)), 1), collapse = ", "), "\n"))
## Kurtosis (neg, even, pos): -2.3, -1.7, -1.7# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Skewness: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Skewness "e1071" has Type = 3 as default. Its Type = 1 matches "moments"
# #Practically, Normal has (small) NON-Zero Positive Skewness
skew_e_t3 <- vapply(dis_lst, e1071::skewness, numeric(1))
skew_e_t2 <- vapply(dis_lst, e1071::skewness, type = 2, numeric(1))
skew_e_t1 <- vapply(dis_lst, e1071::skewness, type = 1, numeric(1))
skew_mmt <- vapply(dis_lst, moments::skewness, numeric(1))
stopifnot(identical(round(skew_e_t1, 10), round(skew_mmt, 10)))
cat(paste0("e1071: Type = 1 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Skewness (Normal, Exp, Beta): 0.0407, 2.0573, -0.6279
cat(paste0("e1071: Type = 2 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Skewness (Normal, Exp, Beta): 0.0407, 2.0576, -0.628
cat(paste0("e1071: Type = 3 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Skewness (Normal, Exp, Beta): 0.0407, 2.057, -0.6278
#
# #Formula: (sigma_ (x_i - mu)^3) /(n * sd^3)
bb <- xxNormal
skew_man <- sum({bb - mean(bb)}^3) / {length(bb) * sd(bb)^3}
cat(paste0("(Manual) Skewness of Normal: ", round(skew_man, 4),
" (vs. e1071 Type 3 = ", round(skew_e_t3[1], 4), ") \n"))
## (Manual) Skewness of Normal: 0.0407 (vs. e1071 Type 3 = 0.0407)set.seed(3)
nn <- 10000L
# #Normal distribution is symmetrical
xxNormal <- rnorm(n = nn, mean = 0, sd = 1)
# #The exponential distribution is positive skew
xxExp <- rexp(n = nn, rate = 1)
# #The beta distribution with hyper-parameters α=5 and β=2 is negative skew
xxBeta <- rbeta(n = nn, shape1 = 5, shape2 = 2)
#
# #Save
f_setRDS(xxNormal)
f_setRDS(xxExp)
f_setRDS(xxBeta)
#f_getRDS(xxNormal)# #Get the Distributions
xxNormal <- f_getRDS(xxNormal)
xxExp <- f_getRDS(xxExp)
xxBeta <- f_getRDS(xxBeta)# #Density Curve
# #Assumes 'hh' has data in 'ee'. In: caption_hh
#Basics
mean_hh <- mean(hh$ee)
sd_hh <- sd(hh$ee)
#
skew_hh <- skewness(hh$ee)
kurt_hh <- kurtosis(hh$ee)
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
axis.ticks = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
axis.line.y = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank()) +
labs(x = "x", y = "Density",
subtitle = paste0("Mean = ", round(mean_hh, 3), "; SD = ", round(sd_hh, 3), "; Skewness = ", round(skew_hh, 3), "; Kurtosis = ", round(kurt_hh, 3)),
caption = caption_hh, title = title_hh)
}
assign(caption_hh, C03)
rm(C03)Distributions with zero excess kurtosis are called mesokurtic. The most prominent example of a mesokurtic distribution is the normal distribution. The kurtosis of any univariate normal distribution is 3.
Distributions with kurtosis less than 3 are said to be platykurtic. It means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers.
Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution.
Kurtosis is the average (or expected value) of the standardized data raised to the fourth power. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the “peak” would be), contribute virtually nothing to kurtosis, since raising a number that is less than 1 to the fourth power makes it closer to zero. The only data values that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. Therefore, kurtosis measures outliers only; it measures nothing about the “peak.”
The sample kurtosis is a useful measure of whether there is a problem with outliers in a data set. Larger kurtosis indicates a more serious outlier problem.
# #Kurtosis Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Kurtosis: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Kurtosis "e1071" has Type = 3 as default. Its Type = 1 matches "moments" with difference of 3
kurt_e_t3 <- vapply(dis_lst, e1071::kurtosis, numeric(1))
kurt_e_t2 <- vapply(dis_lst, e1071::kurtosis, type = 2, numeric(1))
kurt_e_t1 <- vapply(dis_lst, e1071::kurtosis, type = 1, numeric(1))
kurt_mmt <- vapply(dis_lst, moments::kurtosis, numeric(1))
stopifnot(identical(round(kurt_e_t1, 10), round(kurt_mmt - 3, 10)))
cat(paste0("e1071: Type = 1 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Kurtosis (Normal, Exp, Beta): -0.0687, 6.3223, -0.106
cat(paste0("e1071: Type = 2 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Kurtosis (Normal, Exp, Beta): -0.0682, 6.326, -0.1055
cat(paste0("e1071: Type = 3 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Kurtosis (Normal, Exp, Beta): -0.0693, 6.3204, -0.1066
#
# #Formula: (sigma_ (x_i - mu)^4) /(n * sd^4)
bb <- xxNormal
kurt_man <- {sum({bb - mean(bb)}^4) / {length(bb) * sd(bb)^4}} - 3
cat(paste0("(Manual) Kurtosis of Normal: ", round(kurt_man, 4),
" (vs. e1071 Type 3 = ", round(kurt_e_t3[1], 4), ") \n"))
## (Manual) Kurtosis of Normal: -0.0693 (vs. e1071 Type 3 = -0.0693)Measures of relative location help us determine how far a particular value is from the mean. By using both the mean and standard deviation, we can determine the relative location of any observation.
\[z_i = \frac{{x}_i - {\overline{x}}}{{s}} \tag{11.14}\]
NOTE: “Z statistic” is a special case of “Z critical” because \(\sigma/\sqrt{n}\) is the ‘standard error of the sample mean’ which means that it is a standard deviation. Rather than (eg) a known population standard deviation or even just sample standard deviation, per CLT, it is the standard deviation of the sample mean. The ‘critical Z’ (i.e. standard score) is something than can always be computed (“a general case”) whenever there is a mean and standard deviation; it translates X into a Z variable with zero mean and unit variance. (it “imposes normality” when the data may not be normal!). The “Z statistic” similarly standardizes as a special case where it is standardizing the sample mean.
Caution:
xxflights <- f_getRDS(xxflights)
bb <- na.omit(xxflights$air_time)
# Scaling
ii <- {bb - mean(bb)} / sd(bb)
str(ii)
## num [1:327346] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
## - attr(*, "na.action")= 'omit' int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
# #scale() gives a Matrix with original mean and sd as its attribute
jj <- scale(bb)
str(jj)
## num [1:327346, 1] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
## - attr(*, "scaled:center")= num 151
## - attr(*, "scaled:scale")= num 93.7
stopifnot(identical(as.vector(ii), as.vector(jj)))
#
hh <- tibble(ee = as.vector(jj))
title_hh <- "Flights: Air Time (Scaled)"
caption_hh <- "C03P08" #iiiiFigure 11.2 Before and After Scaling
# #hh$ee title_hh caption_hh
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) +
coord_cartesian(ylim = c(0, 35000)) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(Mean= ", round(mean(.[[1]]), 3),
"; SD= ", round(sd(.[[1]]), 3),
")"),
caption = caption_hh, title = title_hh)
}
assign(caption_hh, C03)
rm(C03)if(FALSE){
# #check_overlap = TRUE works for de-blurring. However, it still checks each point thus slow
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE, check_overlap = TRUE)
# #Create your own dataset
geom_text(data = tibble(x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character")),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE )
# #Or Equivalent
ggplot2::annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE)
#
ggpp::annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE)
}# #List All Colour Names in R
str(colors())
## chr [1:657] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" "antiquewhite3" ...
# #Packages: viridis, scales, viridisLite
# #Show N Colours with Max. Contrast
q_colors <- 5
# #Display Colours
if(FALSE) show_col(viridis_pal()(q_colors))
# #Get the Viridis i.e. "D" palette Hex Values for N Colours
v_colors <- viridis(q_colors, option = "D")
v_colors
## [1] "#440154FF" "#3B528BFF" "#21908CFF" "#5DC863FF" "#FDE725FF"Five-Number Summary is used to quickly summarise a dataset. i.e. Min, Q1, Median, Q3, Max
Figure 11.3 geom_boxplot()
# #nycflights13::weather
bb <- weather
# #NA are present in the data
summary(bb$temp)
#
# #BoxPlot
C03P01 <- bb %>% drop_na(temp) %>% mutate(month = factor(month, ordered = TRUE)) %>% {
ggplot(data = ., mapping = aes(x = month, y = temp)) +
#geom_violin() +
geom_boxplot(aes(fill = month), outlier.colour = 'red', notch = TRUE) +
stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") +
scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
#geom_point() +
#geom_jitter(position=position_jitter(0.2)) +
k_gglayer_box +
theme(legend.position = 'none') +
labs(x = "Months", y = "Temperature", subtitle = "With Mean & Notch",
caption = "C03P01", title = "BoxPlot")
}\[\begin{align} \sigma_{xy} &= \frac{\sum (x_i - \mu_x)(y_i - \mu_y)}{n} \\ s_{xy} &= \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{n-1} \end{align} \tag{11.15}\]
Figure 11.4 Scatter Plot Quadrants for Covariance
# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>%
mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY)
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11bb <- f_getRDS(xxCommercials)
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline Equation, R2, mean x & y
C03P02 <- bb %>% {
ggplot(data = ., aes(x = Commercials, y = Sales)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_vline(aes(xintercept = round(mean(Commercials), 3)), color = 'red', linetype = "dashed") +
geom_hline(aes(yintercept = round(mean(Sales), 3)), color = 'red', linetype = "dashed") +
geom_text(aes(label = TeX(r"($\bar{x} = 3$)", output = "character"),
x = round(mean(Commercials), 3), y = -Inf),
color = 'red', , hjust = -0.2, vjust = -0.5, parse = TRUE, check_overlap = TRUE) +
geom_text(aes(label = TeX(r"($\bar{y} = 51$)", output = "character"),
x = Inf, y = round(mean(Sales), 3)),
color = 'red', , hjust = 1.5, vjust = -0.5, parse = TRUE, check_overlap = TRUE) +
geom_point() +
k_gglayer_scatter +
labs(x = "Commercials", y = "Sales ($100s)",
subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"),
caption = "C03P02", title = "Scatter Plot")
}\[\begin{align} \rho_{xy} &= \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}} \\ r_{xy} &= \frac{s_{xy}}{s_{x}s_{y}} \end{align} \tag{11.16}\]
# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>%
mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY)
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11
jj <- ii %>% mutate(devXsq = devX * devX, devYsq = devY * devY)
# #Sample Covariance Sx, Sample Standard Deviations Sx Sy
sxy <- sum(ii$devXY) / {nrow(ii) -1}
sx <- round(sqrt(sum(jj$devXsq) / {nrow(jj) -1}), 2)
sy <- round(sqrt(sum(jj$devYsq) / {nrow(jj) -1}), 2)
cat(paste0("Sxy =", sxy, ", Sx =", sx, ", Sy =", sy, "\n"))
## Sxy =11, Sx =1.49, Sy =7.93
#
# #Correlation Coefficient Rxy
rxy <- round(sxy / {sx * sy}, 2)
cat(paste0("Correlation Coefficient Rxy =", rxy, "\n"))
## Correlation Coefficient Rxy =0.93| Week | Commercials | Sales | devX | devY | devXY | devXsq | devYsq |
|---|---|---|---|---|---|---|---|
| 1 | 2 | 50 | -1 | -1 | 1 | 1 | 1 |
| 2 | 5 | 57 | 2 | 6 | 12 | 4 | 36 |
| 3 | 1 | 41 | -2 | -10 | 20 | 4 | 100 |
| 4 | 3 | 54 | 0 | 3 | 0 | 0 | 9 |
| 5 | 4 | 54 | 1 | 3 | 3 | 1 | 9 |
| 6 | 1 | 38 | -2 | -13 | 26 | 4 | 169 |
| 7 | 5 | 63 | 2 | 12 | 24 | 4 | 144 |
| 8 | 3 | 48 | 0 | -3 | 0 | 0 | 9 |
| 9 | 4 | 59 | 1 | 8 | 8 | 1 | 64 |
| 10 | 2 | 46 | -1 | -5 | 5 | 1 | 25 |
\[\begin{align} n! &= \prod _{i=1}^n i = n \cdot (n-1) \\ &= n \cdot(n-1)\cdot(n-2)\cdot(n-3)\cdot\cdots \cdot 3 \cdot 2 \cdot 1 \end{align} \tag{12.1}\]
\[C_k^N = \binom{N}{k} = \frac{N!}{k!(N-k)!} \tag{12.2}\]
\[P_k^N = k! \binom{N}{k} = \frac{N!}{(N-k)!} \tag{12.3}\]
\[P(A \cup B) = P(A) + P(B) - P(A \cap B) \tag{12.4}\]
|
|
\[\begin{align} P(A \cap B) &= P(B) \cdot P(A | B) \\ &= P(A) \cdot P(B | A) \end{align} \tag{12.5}\]
Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample, a special report, or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes theorem provides a means for making these probability calculations.
\[\begin{align} P(A_{1}|B) &= \frac{P(A_{1})P(B|A_{1})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \\ P(A_{2}|B) &= \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \end{align} \tag{12.6}\]
9.13 Quantitative data that measure ‘how many’ are discrete.
9.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.
| \({x}\) | \(f(x)\) | \(\sum xf(x)\) | \((x - \mu)\) | \((x - \mu)^2\) | \(\sum {(x - \mu)^{2}f(x)}\) |
|---|---|---|---|---|---|
| 0 | 0.18 | 0 | -1.5 | 2.25 | 0.405 |
| 1 | 0.39 | 0.39 | -0.5 | 0.25 | 0.0975 |
| 2 | 0.24 | 0.48 | 0.5 | 0.25 | 0.06 |
| 3 | 0.14 | 0.42 | 1.5 | 2.25 | 0.315 |
| 4 | 0.04 | 0.16 | 2.5 | 6.25 | 0.25 |
| 5 | 0.01 | 0.05 | 3.5 | 12.25 | 0.1225 |
| Total | 1.00 | mu = 1.5 | NA | NA | sigma^2 = 1.25 |
# #Dicarlo: Days with Number of Cars Sold per day for last 300 days
xxdicarlo <- tibble(Cars = 0:5, Days = c(54, 117, 72, 42, 12, 3))
#
bb <- xxdicarlo
bb <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
R_dicarlo_var_y_C05 <- sum(bb$x_mu_sq_Fx)
# #Total Row
bb <- bb %>%
mutate(across(1, as.character)) %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>%
mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) %>%
mutate(across(4:5, ~ replace(., x == "Total", NA)))# #Change Column Classes as required
bb %>% mutate(across(1, as.character))
bb %>% mutate(across(everything(), as.character))bb <- xxdicarlo
ii <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
# #Add Total Row
ii <- ii %>%
mutate(across(1, as.character)) %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum)))
#
# #Modify Specific Row Values without using filter()
# #filter() does not have 'un-filter()' function like group()-ungroup() combination
# #Selecting Row where x = "Total" and changing Column Values for Two Columns
ii <- ii %>%
mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx))
#
# #Selecting Row where x = "Total" and doing same replacement on Two Columns
ii %>% mutate(across(4:5, function(y) replace(y, x == "Total", NA)))
ii %>% mutate(across(4:5, ~ replace(., x == "Total", NA)))
|
|
bb <- xxdicarlo_gs
# #Assuming there is NO Total Column NOR Total Row and First Column is character
kk <- bb %>% summarise(across(where(is.numeric), sum)) %>% summarise(sum(.)) %>% pull(.)
ll <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
stopifnot(identical(kk, ll))
print(kk)
## [1] 300bb <- xxdicarlo_gs
# #Round off values to 1 significant digits i.e. 0.003 or 0.02
# #NOTE: This changes the column to "character"
bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>%
mutate(across(where(is.numeric), format, digits =1))
## # A tibble: 4 x 7
## Geneva_Saratoga y0 y1 y2 y3 y4 y5
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 x0 0.07 0.10 0.08 0.03 0.007 0.000
## 2 x1 0.07 0.12 0.11 0.06 0.007 0.003
## 3 x2 0.03 0.14 0.03 0.04 0.010 0.007
## 4 x3 0.01 0.03 0.02 0.01 0.017 0.000| \(ID\) | \({s}\) | \(f(s)\) | \(\sum sf(s)\) | \((s - E(s))\) | \((s - E(s))^2\) | \(\sum {(s - E(s))^{2}f(s)}\) |
|---|---|---|---|---|---|---|
| A | 0 | 0.070 | 0.00 | -2.64 | 6.99 | 0.489 |
| B | 1 | 0.170 | 0.17 | -1.64 | 2.70 | 0.459 |
| C | 2 | 0.230 | 0.46 | -0.64 | 0.41 | 0.095 |
| D | 3 | 0.290 | 0.87 | 0.36 | 0.13 | 0.037 |
| E | 4 | 0.127 | 0.51 | 1.36 | 1.84 | 0.233 |
| F | 5 | 0.067 | 0.33 | 2.36 | 5.55 | 0.370 |
| G | 6 | 0.023 | 0.14 | 3.36 | 11.27 | 0.263 |
| H | 7 | 0.023 | 0.16 | 4.36 | 18.98 | 0.443 |
| I | 8 | 0.000 | 0.00 | 5.36 | 28.69 | 0.000 |
| Total | NA | 1.000 | E(s) = 2.64 | NA | NA | Var(s) = 2.389 |
bb <- xxdicarlo_gs
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
# #Convert to Bivariate Probability Distribution
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>%
relocate(Fs, .after = last_col()) %>%
mutate(sFs = s * Fs, s_Es = s - sum(sFs),
s_Es_sq = s_Es * s_Es, s_Es_sq_Fs = s_Es_sq * Fs)
# #Save for Notebook
R_dicarlo_var_s_C05 <- sum(kk$s_Es_sq_Fs)
# #For Printing
ll <- kk %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>%
mutate(across(where(is.numeric), format, digits =2)) %>%
mutate(sFs = ifelse(ID == "Total", paste0("E(s) = ", sFs), sFs),
s_Es_sq_Fs = ifelse(ID == "Total", paste0("Var(s) = ", s_Es_sq_Fs), s_Es_sq_Fs)) %>%
mutate(across(c(2, 5, 6), ~ replace(., ID == "Total", NA)))bb <- xxdicarlo_gs
# #From the Bivariate get the original data
ii <- bb %>%
mutate(Fx = rowSums(across(where(is.numeric)))) %>%
select(1, 8) %>%
separate(col = Geneva_Saratoga, into = c(NA, "x"), sep = 1) %>%
mutate(across(1, as.integer))
# #Variance Calculation
jj <- ii %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
# #Save for Notebook
R_dicarlo_var_x_C05 <- sum(jj$x_mu_sq_Fx)
print(jj)
## # A tibble: 4 x 6
## x Fx xFx x_mu x_mu_sq x_mu_sq_Fx
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.287 0 -1.14 1.31 0.375
## 2 1 0.37 0.37 -0.143 0.0205 0.00760
## 3 2 0.257 0.513 0.857 0.734 0.188
## 4 3 0.0867 0.26 1.86 3.45 0.299bb <- xxdicarlo_gs
#
# #Tibble Total SUM
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Convert to Bivirate Probability Distribution and Exclude First Character Column
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
#
# #(1A, 2B, 3C, 4D, 4E, 4F, 3G, 2H, 1I) 9 Unique Combinations = 24 (4x6) Experimental Outcomes
matrix(data = LETTERS[row(ii) + col(ii)-1], nrow = 4)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "A" "B" "C" "D" "E" "F"
## [2,] "B" "C" "D" "E" "F" "G"
## [3,] "C" "D" "E" "F" "G" "H"
## [4,] "D" "E" "F" "G" "H" "I"
#
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
print(jj)
## A B C D E F G H I
## 0.07000000 0.17000000 0.23000000 0.29000000 0.12666667 0.06666667 0.02333333 0.02333333 0.00000000
# #In place of LETTERS, Numerical Index can also be used but Letters are more clear for grouping
#tapply(X= as.matrix(ii), INDEX = c(0:8)[row(ii) + col(ii)-1], FUN = sum)
#
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>%
relocate(Fs, .after = last_col())
print(kk)
## # A tibble: 9 x 3
## ID s Fs
## <chr> <dbl> <dbl>
## 1 A 0 0.07
## 2 B 1 0.17
## 3 C 2 0.23
## 4 D 3 0.29
## 5 E 4 0.127
## 6 F 5 0.0667
## 7 G 6 0.0233
## 8 H 7 0.0233
## 9 I 8 0bb <- xxdicarlo_gs
# #Separate String based on Position
bb %>% separate(col = Geneva_Saratoga, into = c("A", "B"), sep = 1)
## # A tibble: 4 x 8
## A B y0 y1 y2 y3 y4 y5
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x 0 21 30 24 9 2 0
## 2 x 1 21 36 33 18 2 1
## 3 x 2 9 42 9 12 3 2
## 4 x 3 3 9 6 3 5 0\[\sigma_{xy} = \frac{\text{Var}(x + y) - \text{Var}(x) - \text{Var}(y)}{2} \tag{13.1}\]
dbinom(), pbinom(), qbinom(), rbinom()
dpois(), ppois(), qpois(), rpois()
13.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
13.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.
\[\begin{align} E(x) &= \frac{a + b}{2} \\ \text{Var}(x) &= \frac{(b - a)^2}{12} \end{align} \tag{14.1}\]
\[f(x) = {\frac {1}{{\sigma}{\sqrt {2 \pi}}}} e^{-{\frac {1}{2}}\left( {\frac {x-{\mu} }{\sigma}}\right) ^{2}} \tag{14.2}\]
Figure 14.1 Normal Distribution
# #Histogram with Density Curve, Mean and Median: Normal Distribution
ee <- f_getRDS(xxNormal)
hh <- tibble(ee)
ee <- NULL
# #Basics
median_hh <- round(median(hh[[1]]), 3)
mean_hh <- round(mean(hh[[1]]), 3)
sd_hh <- round(sd(hh[[1]]), 3)
len_hh <- nrow(hh)
#
# #Base Plot: Creates Only Density Function Line
ii <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + geom_density() }
#
# #Change the line colour and alpha
ii <- ii + geom_density(alpha = 0.2, colour = "#21908CFF")
#
# #Add Histogram with 50 bins, alpha and fill
ii <- ii + geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.4, fill = '#FDE725FF')
#
# #Full Vertical Line at Mean. Goes across Function Boundary on Y-Axis
#ii <- ii + geom_vline(aes(xintercept = mean_hh), color = '#440154FF')
#
# #Shaded Area Object for line /Area upto the the Function Boundary on Y-Axis
# #Mean
ii_mean <- ggplot_build(ii)$data[[1]] %>% filter(x <= mean_hh)
# #Median
ii_median <- ggplot_build(ii)$data[[1]] %>% filter(x <= median_hh)
#
# #To show values which are less than Mean in colour
#ii <- ii + geom_area(data = ii_mean, aes(x = x, y = y), fill = 'blue', alpha = 0.5)
#
# #Line upto the Density Curve at Mean
ii <- ii + geom_segment(data = ii_mean,
aes(x = mean_hh, y = 0, xend = mean_hh, yend = density), color = "#440154FF")
#
# #Label 'Mean'
ii <- ii + geom_text(aes(label = paste0("Mean= ", mean_hh), x = mean_hh, y = -Inf),
color = '#440154FF', hjust = -0.5, vjust = -1, angle = 90, check_overlap = TRUE)
#
# #Similarly, Median Line and Label
ii <- ii + geom_segment(data = ii_median,
aes(x = median_hh, y = 0, xend = median_hh, yend = density), color = "#3B528BFF") +
geom_text(aes(label = paste0("Median= ", median_hh), x = median_hh, y = -Inf),
color = '#3B528BFF', hjust = -0.4, vjust = 1.2, angle = 90, check_overlap = TRUE)
#
# #Change Axis Limits
ii <- ii + coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.5))
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
ii <- ii + scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh)
#
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 3 * sd_hh)
sig3r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 3 * sd_hh)
sig2r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 2 * sd_hh, x < mean_hh + 3 * sd_hh)
sig2l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 2 * sd_hh, x > mean_hh - 3 * sd_hh)
sig1r_hh <- density_hh_tbl %>% filter(x >= mean_hh + sd_hh, x < mean_hh + 2 * sd_hh)
sig1l_hh <- density_hh_tbl %>% filter(x <= mean_hh - sd_hh, x > mean_hh - 2 * sd_hh)
#
# #Use (mean +/- 3 sigma) To Highlight. NOT ALL Zones have been highlighted
ii <- ii + geom_area(data = sig3l_hh, aes(x = x, y = y), fill = 'red') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = 'red')
#
# #Annotate Arrows
ii <- ii +
# ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[3], y = 0.42,
# yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[2], y = 0.45,
# yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[1], y = 0.48,
yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[5], y = 0.42,
# yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[6], y = 0.45,
# yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[7], y = 0.48,
yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc")))
#
# #Annotate Labels
ii <- ii +
# ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.42, label = "68.3%") +
# ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.45, label = "95.4%") +
ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.48, label = "99.7%")
#
# #Add a Theme and adjust Position of Title & Subtile (Both by plot.title.position) & Caption
# #"plot" or "panel"
ii <- ii + theme(#plot.tag.position = "topleft",
#plot.caption.position = "plot",
#plot.caption = element_text(hjust = 0),
plot.title.position = "panel")
#
# #Title, Subtitle, Caption, Axis Labels, Tag
ii <- ii + labs(x = "x", y = "Density",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh),
caption = "C06AA", tag = NULL,
title = "Normal Distribution (Symmetrical)")
#
#ii# #Syntax
#latex2exp::Tex(r('$\sigma =10$'), output = "character")
# #Test Equation
plot(TeX(r'(abc: $\frac{2hc^2}{\lambda^5} \, \frac{1}{e^{\frac{hc}{\lambda k_B T}} - 1}$)'), cex=2)
plot(TeX(r'(xyz: $f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \,\left(\frac{x - \mu}{\sigma}\right)^2} $)'), cex=2)# #Syntax
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($\sigma =10$)', output = "character"), parse = TRUE, check_overlap = TRUE)
# #NOTE: Complex Equations like Normal Distribution are crashing the R.
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \, \left(\frac{x - \mu}{\sigma}\right)^2} $)', output = "character"), parse = TRUE, check_overlap = TRUE)# #Data
bb <- f_getRDS(xxNormal)
hh <- tibble(bb)
# #Base Plot
ii <- hh %>% { ggplot(data = ., mapping = aes(x = bb)) + geom_density() }
# #Attributes
attributes(ggplot_build(ii))$names
## [1] "data" "layout" "plot"
#
str(ggplot_build(ii)$data[[1]])
## 'data.frame': 512 obs. of 18 variables:
## $ y : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ x : num -3.63 -3.61 -3.6 -3.58 -3.57 ...
## $ density : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ scaled : num 0.00126 0.0013 0.00133 0.00136 0.00137 ...
## $ ndensity : num 0.00126 0.0013 0.00133 0.00136 0.00137 ...
## $ count : num 5.04 5.2 5.32 5.41 5.45 ...
## $ n : int 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
## $ flipped_aes: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ PANEL : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ fill : logi NA NA NA NA NA NA ...
## $ weight : num 1 1 1 1 1 1 1 1 1 1 ...
## $ colour : chr "black" "black" "black" "black" ...
## $ alpha : logi NA NA NA NA NA NA ...
## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype : num 1 1 1 1 1 1 1 1 1 1 ....[[1]] is discouraged. Use .data[[1]] instead.”STOP! STOP! Just STOP! using UNICODE for R Console on WINDOWS (UTF-8 Issue).
\[f(z) = \varphi(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}} \tag{14.3}\]
11.17 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
\({z} \in \mathbb{R} \iff P_{(z)} \in (0, 1)\)
# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
pnorm(q = 1.00)
## [1] 0.8413447# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
# #pnorm(q = 1.00) #default 'lower.tail = TRUE'
z_ii <- 1.00
p_ii <- round(pnorm(q = z_ii, lower.tail = TRUE), 4)
cat(paste0("P(z <= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z <= 1.000) = 0.8413
#
# #Probability that z is in the interval between −.50 and 1.25 #0.5859
z_min_ii <- -0.50
z_max_ii <- 1.25
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-0.500 <= z <= 1.250) = 0.5858
#
# #Probability of obtaining a z value of at least 1.58 #0.0571
z_ii <- 1.58
p_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
cat(paste0("P(z >= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z >= 1.580) = 0.0571
#
# #Probability that the z is within one standard deviation of the mean i.e. [-1, 1] #0.6826
z_min_ii <- -1.00
z_max_ii <- 1.00
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-1.000 <= z <= 1.000) = 0.6827# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10
qnorm(p = 1 - 0.10)
## [1] 1.281552
qnorm(p = 0.10, lower.tail = FALSE)
## [1] 1.281552# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10 i.e. right >10%
#qnorm(p = 1 - 0.10)
#qnorm(p = 0.10, lower.tail = FALSE)
p_r_ii <- 0.10
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
z_jj <- round(qnorm(p = p_r_ii, lower.tail = FALSE), 4)
stopifnot(identical(z_ii, z_jj))
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.900 (i.e. (Right) 1-P(z) = 0.100) at z = 1.2816\[z = \frac{x - {\mu}}{{\sigma}} \tag{14.4}\]
Reasons to convert normal distributions into the standard normal distribution:
Each z-score is associated with a probability, or p-value, that gives the likelihood of values below that z-score occurring. By converting an individual value into a z-score, we can find the probability of all values up to that value occurring in a normal distribution.
The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.
The area under the curve to the right of a z-score is the p-value, and it’s the likelihood of your observation occurring if the null hypothesis is true.
Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.
# #For N(mu =10, sd =2) Probability that X is in [10, 14]
# #Same as P(0 <= z <= 2)
mu_ii <- 10
sd_ii <- 2
x_min_ii <- 10
x_max_ii <- 14
#
z_min_ii <- (x_min_ii - mu_ii) /sd_ii #0
z_max_ii <- (x_max_ii - mu_ii) /sd_ii #2
#
pz_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
# #OR
px_ii <- round(pnorm(q = x_max_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE) -
pnorm(q = x_min_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(identical(pz_ii, px_ii))
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", pz_ii, "\n"))
## P(0.000 <= z <= 2.000) = 0.4772
cat(paste0("P(", x_min_ii, " <= x <= ", x_max_ii, ") = ", format(px_ii, nsmall = 3), "\n"))
## P(10 <= x <= 14) = 0.4772# #Grear Tire N(mu = 36500, sd =5000)
# #Probability that the tire mileage, x, will exceed 40000 # 24.2% Tires
mu_ii <- 36500
sd_ii <- 5000
x_ii <- 40000
#
z_ii <- (x_ii - mu_ii)/sd_ii
#
#pnorm(q = 40000, mean = 36500, sd = 5000, lower.tail = FALSE)
pz_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
px_ii <- round(pnorm(q = x_ii, mean = mu_ii, sd = sd_ii, lower.tail = FALSE), 4)
stopifnot(identical(px_ii, pz_ii))
#
cat(paste0("P(x >= ", x_ii, ") = ", format(px_ii, nsmall = 4), " (",
round(100* px_ii, 2), "%)\n"))
## P(x >= 40000) = 0.2420 (24.2%)
#
# #What should the guarantee mileage be if no more than 10% of the tires to be eligible
# #for the discount guarantee i.e. left <10% # ~30100 miles
p_l_ii <- 0.10
p_r_ii <- 1 - p_l_ii
#
#qnorm(p = 0.10, mean = 36500, sd = 5000)
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
xz_ii <- z_ii * sd_ii + mu_ii
#
x_ii <- round(qnorm(p = p_l_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(abs(xz_ii - x_ii) < 1)
cat(paste0("(Left) P(x) = ", p_l_ii, " (i.e. (Right) 1-P(z) = ", p_r_ii,
") at x = ", round(x_ii, 1), "\n"))
## (Left) P(x) = 0.1 (i.e. (Right) 1-P(z) = 0.9) at x = 30092.29.2 Elements are the entities on which data are collected. (Generally ROWS)
9.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
9.20 A population is the set of all elements of interest in a particular study.
9.21 A sample is a subset of the population.
9.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.
The sample contains only a portion of the population. Some sampling error is to be expected. So, the sample results provide only estimates of the values of the corresponding population characteristics.
Random sample vs. SRS
Suppose, from a Population, we take a sample of size \({n}\) and calculate point estimate mean \(\overline{x}_{1}\). Further, we can select another random sample from the Population and get another point estimate mean \(\overline{x}_{2}\). If we repeat this process for 500 times, we will have a frame of \(\{\overline{x}_{1}, \overline{x}_{2}, \ldots, \overline{x}_{500}\}\).
If we consider the process of selecting a simple random sample as an experiment, the sample mean \({\overline{x}}\) is the numerical description of the outcome of the experiment. Thus, the sample mean \({\overline{x}}\) is a random variable. As a result, just like other random variables, \({\overline{x}}\) has a mean or expected value, a standard deviation, and a probability distribution. Because the various possible values of \({\overline{x}}\) are the result of different simple random samples, the probability distribution of \({\overline{x}}\) is called the sampling distribution of \({\overline{x}}\). Knowledge of this sampling distribution and its properties will enable us to make probability statements about how close the sample mean \({\overline{x}}\) is to the population mean \({\mu}\).
Just as with other probability distributions, the sampling distribution of \({\overline{x}}\) has an expected value or mean, a standard deviation, and a characteristic shape or form.
\[\begin{align} \text{Finite Population:} \sigma_{\overline{x}} &= \sqrt{\frac{N - n}{N-1}}\left(\frac{\sigma}{\sqrt{n}} \right) \\ \text{Infinite Population:} \sigma_{\overline{x}} &= \frac{\sigma}{\sqrt{n}} \end{align} \tag{15.1}\]
Refer Effect of Sample Size and Repeat Sampling
Figure 15.1 Effect of Sample Size vs Repeat Sampling
“ForLater”
If a statistically independent sample of \({n}\) observations \({{x}_1, {x}_2, \ldots, {x}_n}\) is taken from a statistical population with a standard deviation of \(\sigma\), then the mean value calculated from the sample \(\overline{x}\) will have an associated standard error of the mean \(\sigma_\overline{x}\) given by
\[\sigma_\overline{x} = \frac{\sigma}{\sqrt{n}} \tag{15.2}\]
The standard deviation \(\sigma\) of the population being sampled is seldom known. Therefore, \(\sigma_\overline{x}\) is usually estimated by replacing \(\sigma\) with the sample standard deviation \(\sigma_{x}\) instead:
\[\sigma_\overline{x} \approx \frac{\sigma_{x}}{\sqrt{n}} \tag{15.3}\]
As this is only an ‘estimator’ for the true “standard error,” other notations are used, such as:
\[\widehat{\sigma}_\overline{x} = \frac{\sigma_{x}}{\sqrt{n}} \tag{15.4}\]
OR:
\[{s}_\overline{x} = \frac{s}{\sqrt{n}} \tag{15.5}\]
Key:
Non-mathematical view:
Form of the Sampling Distribution of \({\overline{x}}\)
When the population has a normal distribution, the sampling distribution of \({\overline{x}}\) is normally distributed for any sample size.
When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of \({\overline{x}}\).
How large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal
Task of developing a profile of 2500 managers. The characteristics to be identified include the mean annual salary for the managers and the proportion of managers having completed a training.
Caution: Here, we took advantage of the fact that the population mean \({\mu}\) and the population standard deviation \({\sigma}\) were known. However, usually these values will be unknown.
Properties of Point Estimators
Three properties of good point estimators: unbiased, efficiency, and consistency.
\(\theta = \text{the population parameter of interest}\) \(\hat{\theta} = \text{the sample statistic or point estimator of } \theta\)
Other Sampling Methods
In order to develop an interval estimate of a population mean, either the population standard deviation \({\sigma}\) or the sample standard deviation \({s}\) must be used to compute the margin of error. In most applications \({\sigma}\) is not known, and \({s}\) is used to compute the margin of error.
In some applications, large amounts of relevant historical data are available and can be used to estimate the population standard deviation prior to sampling. Also, in quality control applications where a process is assumed to be operating correctly, or ‘in control,’ it is appropriate to treat the population standard deviation as known.
Sampling distribution of \({\overline{x}}\) can be used to compute the probability that \({\overline{x}}\) will be within a given distance of \({\mu}\).
Example: Lloyd Department Store
Interval Estimate of a Population Mean: \({\sigma}\) known is given by equation (16.1)
\[\begin{align} \overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \end{align} \tag{16.1}\]
where \((1 − \alpha)\) is the confidence coefficient and \(z_{\alpha/2}\) is the z-value providing an area of \(\alpha/2\) in the upper tail of the standard normal probability distribution.
For a 95% confidence interval, the confidence coefficient is \((1 − \alpha) = 0.95\) and thus, \(\alpha = 0.05\). Using the standard normal probability table, an area of \(\alpha/2 = 0.05/2 = 0.025\) in the upper tail provides \(z_{.025} = 1.96\).
# #Find z-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96
#
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.2816 1.6449 1.9600 2.3263 2.5758The t distribution is a family of similar probability distributions, with a specific t distribution depending on a parameter known as thedegrees of freedom. As the number of degrees of freedom increases, the difference between the t distribution and the standard normal distribution becomes smaller and smaller.
Just as \(z_{0.025}\) was used to indicate the z value providing a 0.025 area in the upper tail of a standard normal distribution, \(t_{0.025}\) indicates a 0.025 area in the upper tail of a t distribution. In general, the notation \(t_{\alpha/2}\) represents a t value with an area of \(\alpha/2\) in the upper tail of the t distribution.
As the degrees of freedom increase, the t distribution approaches the standard normal distribution. Ex: \(t_{0.025} = 2.262 \, (\text{DOF} = 9)\), \(t_{0.025} = 2.200 \, (\text{DOF} = 60)\), and \(t_{0.025} = 1.96 \, (\text{DOF} = \infty) = z_{0.025}\)
Interval Estimate of a Population Mean: \({\sigma}\) Unknown is given by equation (16.2)
\[\begin{align} \overline{x} \pm t_{\alpha/2} \frac{s}{\sqrt{n}} \end{align} \tag{16.2}\]
where \({s}\) is the sample standard deviation, \((1 − \alpha)\) is the confidence coefficient and \(t_{\alpha/2}\) is the t-value providing an area of \(\alpha/2\) in the upper tail of the t distribution with \({n-1}\) degrees of freedom.
Refer equation (11.12), the expression for the sample standard deviation is
\[{s} = \sqrt{\frac{\sum \left(x_i - \overline{x}\right)^2}{n-1}}\]
Why \((n-1)\) are the degrees of freedom
Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.
# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)
#
dof_ii <- 60
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.0003 (dof = 60)
#
dof_ii <- 600
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9639 (dof = 600)
#
# #t-table have Infinity Row which is same as z-table. For DOF >100, it can be used.
dof_ii <- Inf
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.96 (dof = Inf)
#
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96# #A sample of n = 70 households provided the credit card balances.
xxCreditCards <- c(9430, 7535, 4078, 5604, 5179, 4416, 10676, 1627, 10112, 6567, 13627, 18719, 14661, 12195, 10544, 13659, 7061, 6245, 13021, 9719, 2200, 10746, 12744, 5742, 7159, 8137, 9467, 12595, 7917, 11346, 12806, 4972, 11356, 7117, 9465, 19263, 9071, 3603, 16804, 13479, 14044, 6817, 6845, 10493, 615, 13627, 12557, 6232, 9691, 11448, 8279, 5649, 11298, 4353, 3467, 6191, 12851, 5337, 8372, 7445, 11032, 6525, 5239, 6195, 12584, 15415, 15917, 12591, 9743, 10324)
f_setRDS(xxCreditCards)bb <- f_getRDS(xxCreditCards)
mean_bb <- mean(bb)
sd_bb <- sd(bb)
dof_bb <- length(bb) - 1L
# #t-value for confidence interval 95% | (1-alpha) = 0.95 | alpha = 0.05 | alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
dof_ii <- dof_bb
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9949 (dof = 69)
#
# #Interval Estimate
err_margin_bb <- t_ii * sd_bb / sqrt(length(bb))
est_l <- mean_bb - err_margin_bb
est_r <- mean_bb + err_margin_bb
#
cat(paste0("Normal Sample (n=", length(bb), ", mean=", mean_bb, ", sd=", round(sd_bb, 1),
"):\n Point Estimate = ", mean_bb, ", Margin of error = ", round(err_margin_bb, 1),
", ", (1-2*p_r_ii) * 100, "% confidence interval is [",
round(est_l, 1), ", ", round(est_r, 1), "]"))
## Normal Sample (n=70, mean=9312, sd=4007):
## Point Estimate = 9312, Margin of error = 955.4, 95% confidence interval is [8356.6, 10267.4]Note:
Note:
All hypothesis testing applications involve collecting a sample and using the sample results to provide evidence for drawing a conclusion.
In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis.
For hypothesis tests involving a population mean, we let \({\mu}_0\) denote the hypothesized value and we must choose one of the following three forms for the hypothesis test.
Alternative is One-Sided, if it states that a parameter is larger or smaller than the null value. Alternative is Two-sided, if it states that the parameter is different from the null value.
Refer Equality in Hypothesis
Refer Type I and Type II Errors (B12)
Ideally the hypothesis testing procedure should lead to the acceptance of \({H_0}\) when \({H_0}\) is true and the rejection of \({H_0}\) when \({H_a}\) is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors.
Figure 17.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
16.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.
Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Hence, if we decide to accept \({H_0}\), we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).” Using the statement “do not reject \({H_0}\)” carries the recommendation to withhold both judgment and action. In effect, by not directly accepting \({H_0}\), the statistician avoids the risk of making a Type II error.
Refer figure 17.1
17.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
13.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
The test statistic summarizes the observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in the statistical model. Refer Table 17.1
| Test statistic | \({H_0}\) and \({H_a}\) | Statistical tests that use it |
|---|---|---|
| t-value | Null: The means of two groups are equal | T-test, Regression tests |
| Alternative: The means of two groups are not equal | ||
| z-value | Null: The means of two groups are equal | Z-test |
| Alternative:The means of two groups are not equal | ||
| F-value | Null: The variation among two or more groups is greater than or equal to the variation between the groups | ANOVA, ANCOVA, MANOVA |
| Alternative: The variation among two or more groups is smaller than the variation between the groups | ||
| \({\chi}^2\text{-value}\) | Null: Two samples are independent | Chi-squared test, Non-parametric correlation tests |
| Alternative: Two samples are not independent (i.e. they are correlated) |
11.14 A tail refers to the tapering sides at either end of a distribution curve.
One tailed-tests are concerned with one side of a statistic. Thus, one-tailed tests deal with only one tail of the distribution, and the z-score is on only one side of the statistic. Whereas, Two-tailed tests deal with both tails of the distribution, and the z-score is on both sides of the statistic.
In a one-tailed test, the area under the rejection region is equal to the level of significance, \({\alpha}\). When the rejection region is below the acceptance region, we say that it is a left-tail test. Similarly, when the rejection region is above the acceptance region, we say that it is a right-tail test.
In the two-tailed test, there are two critical regions, and the area under each region is \(\frac{\alpha}{2}\).
One-Tail vs. Two-Tail
One-tailed tests about a population mean take one of the following two forms:
17.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
17.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Example: The label on a can of Hilltop Coffee states that the can contains 3 pounds of coffee. As long as the population mean filling weight is at least 3 pounds per can, the rights of consumers will be protected. Thus, the government (FTC) interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can.
\[z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} \tag{17.1}\]
The key question for a lower tail test is, How small must the test statistic \({z}\) be before we choose to reject the null hypothesis
Two approaches can be used to answer this: the p-value approach and the critical value approach.
p-value (p) is the probability of obtaining a result equal to or more extreme than was observed in the data. It is the probability of observing the result given that the null hypothesis is true. A small p-value indicates the value of the test statistic is unusual given the assumption that \({H_0}\) is true.
For a lower tail test, the p-value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. - we use the standard normal distribution to find the probability that \({z}\) is less than or equal to the value of the test statistic. - After computing the p-value, we must then decide whether it is small enough to reject the null hypothesis; this decision involves comparing the p-value to the level of significance.
For the Hilltop Coffee Example
Rejection Rule: Reject \({H_0}\) if p-value \(\leq {\alpha}\)
Further, in this case, we would reject \({H_0}\) for any value of \({\alpha} \geq (p = 0.0038)\). For this reason, the p-value is also called the observed level of significance.
For a lower tail test, the critical value serves as a benchmark for determining whether the value of the test statistic is small enough to reject the null hypothesis. - Critical value is the value of the test statistic that corresponds to an area of \({\alpha}\) (the level of significance) in the lower tail of the sampling distribution of the test statistic. - In other words, the critical value is the largest value of the test statistic that will result in the rejection of the null hypothesis.
Hilltop Coffee Example
Rejection Rule: Reject \({H_0}\) if \(z \leq z_{\alpha}\)
The p-value approach to hypothesis testing and the critical value approach will always lead to the same rejection decision; that is, whenever the p-value is less than or equal to \({\alpha}\), the value of the test statistic will be less than or equal to the critical value.
For upper tail test The test statistic \({z}\) is still computed as earlier. But, for an upper tail test, the p-value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. Thus, to compute the p-value for the upper tail test in the \({\sigma}\) known case, we must use the standard normal distribution to find the probability that \({z}\) is greater than or equal to the value of the test statistic. Using the critical value approach causes us to reject the null hypothesis if the value of the test statistic is greater than or equal to the critical value \(z_{\alpha}\); in other words, we reject \({H_0}\) if \(z \geq z_{\alpha}\).
16.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
\[Z = \frac {{\overline{x}} - {\mu}}{{\sigma}/{\sqrt{n}}} \quad \iff {\mu} = {\overline{x}} - Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} = {\overline{x}} \pm Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} \approx {\overline{x}} \pm Z \frac{{s}}{\sqrt{n}} \tag{17.2}\]
17.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
Ex: Golf Company, mean driving distance is 295 yards i.e. \(({\mu}_0 = 295)\)
“ForLater” - This part needs to be moved to the Next Chapter.
Rejection Rule: Reject \({H_0}\) if \(z \leq -z_{\alpha/2}\) or \(z \geq z_{\alpha/2}\)
(Online, might be wrong) Ex: Assume that for a Population with mean \({\mu}\) unknown and standard deviation \({\sigma} = 15\), if we take a sample \({n = 100}\) its sample mean is \({\overline{x}} = 42\).
Assume \({\alpha} = 0.05\) and if we are conducting a Two Tail Test, \(Z_{\alpha/2 = 0.05/2} = 1.960\)
As shown in the equation (17.2), our interval range is \(\mu = \overline{X} \pm 2.94 = 42 \pm 2.94 \rightarrow \mu \in (39.06, 44.94)\)
We are 95% confident that the population mean will be between 39.04 and 44.94
Note that a 95% confidence interval does not mean there is a 95% chance that the true value being estimated is in the calculated interval. Rather, given a population, there is a 95% chance that choosing a random sample from this population results in a confidence interval which contains the true value being estimated.
Common Steps
p-Value Approach Step
Critical Value Approach
Refer equation (16.1), For the \({\sigma}\) known case, the \({(1 - \alpha)}\%\) confidence interval estimate of a population mean is given by
\[\overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]
We know that \(100 {(1 - \alpha)}\%\) of the confidence intervals generated will contain the population mean and \(100 {\alpha}\%\) of the confidence intervals generated will not contain the population mean.
Thus, if we reject \({H_0}\) whenever the confidence interval does not contain \({\mu}_0\), we will be rejecting the null hypothesis when it is true \((\mu = {\mu}_0)\) with probability \({\alpha}\).
The level of significance is the probability of rejecting the null hypothesis when it is true. So constructing a \(100 {(1 - \alpha)}\%\) confidence interval and rejecting \({H_0}\) whenever the interval does not contain \({\mu}_0\) is equivalent to conducting a two-tailed hypothesis test with \({\alpha}\) as the level of significance.
Ex: Golf company
“ForLater” - Exercises
\[t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}} \tag{17.3}\]
One-Tailed Test
Critical Value Approach - \((\text{DOF = 59}), \, t_{{\alpha} = 0.05} = 1.671\) - Because \((t = 1.84) > (t_{{\alpha} = 0.05} = 1.671)\), Reject \({H_0}\)
# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
#
# #p-value approach: Find Commulative Probability P corresponding to the given t-value & DOF=59
pt(q = 1.84, df = 59, lower.tail = FALSE)
## [1] 0.03539999
#
# #Critical Value: t-value for which Area under the curve towards Right is alpha=0.05 & DOF=59
qt(p = 0.05, df = 59, lower.tail = FALSE)
## [1] 1.671093Two Tailed Test
Critical Value Approach - \((\text{DOF = 24})\) - We find that \(P_{\left(t\right)} = 0.025\) for \(-t_{\alpha/2 = 0.025} = -2.064\) and \(t_{\alpha/2 = 0.025} = 2.064\) - Compare test statistic with z-value - Because \((t = -1.10)\) is NOT lower than \((-z_{\alpha/2 = 0.025} = -2.064)\), we cannot reject \({H_0}\)
Using \({p}_0\) to denote the hypothesized value for the population proportion, the three forms for a hypothesis test about a population proportion \({p}\) are :
Hypothesis tests about a population proportion are based on the difference between the sample proportion \({\overline{p}}\) and the hypothesized population proportion \({p}_0\)
The sampling distribution of \({\overline{p}}\), the point estimator of the population parameter \({p}\), is the basis for developing the test statistic.
When the null hypothesis is true as an equality, the expected value of \({\overline{p}}\) equals the hypothesized value \({p}_0\) i.e. \(E_{(\overline{p})} = {p}_0\)
The standard error of \({\overline{p}}\) is given in equation (17.4)
\[{\sigma}_{\overline{p}} = \sqrt{\frac{{p}_0 (1 - {p}_0)}{n}} \tag{17.4}\]
If \(np \geq 5\) and \(n(1 − p) \geq 5\), the sampling distribution of \({p}\) can be approximated by a normal distribution. Under these conditions, which usually apply in practice, the quantity \({z}\) as given in equation (17.5) has a standard normal probability distribution.
Test Statistic for Hypothesis Tests about a Population Proportion :
\[z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = \frac{{\overline{p}} - {p}_0}{\sqrt{\frac{{p}_0 (1 - {p}_0)}{n}}} \tag{17.5}\]
Example: Pine Creek: Determine whether the proportion of women golfers increased from \(p_0 = 0.20\)
17.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
{0.25 - 0.20}/{sqrt(0.20 * {1 - 0.20} / 400)} \(\#\mathcal{R}\)pnorm(q = 2.50, lower.tail = FALSE) \(\#\mathcal{R}\)If the purpose of a hypothesis test is to make a decision when \({H_0}\) is true and a different decision when \({H_a}\) is true, the decision maker may want to, and in some cases be forced to, take action with both the conclusion do not reject \({H_0}\) and the conclusion reject \({H_0}\).
If this situation occurs, statisticians generally recommend controlling the probability of making a Type II error. With the probabilities of both the Type I and Type II error controlled, the conclusion from the hypothesis test is either to accept \({H_0}\) or reject \({H_0}\). In the first case, \({H_0}\) is concluded to be true, while in the second case, \({H_a}\) is concluded true. Thus, a decision and appropriate action can be taken when either conclusion is reached.
“ForLater” - Calculate \({\beta}\)
When the true population mean \({\mu}\) is close to the null hypothesis value of \({\mu} = 120\), the probability is high that we will make a Type II error. However, when the true population mean \({\mu}\) is far below the null hypothesis value of \({\mu} = 120\), the probability is low that we will make a Type II error.
Note that the power curve extends over the values of \({\mu}\) for which the null hypothesis is false. The height of the power curve at any value of \({\mu}\) indicates the probability of correctly rejecting \({H_0}\) when \({H_0}\) is false.
We can make 3 observations about the relationship among \({\alpha}, \beta, n (\text{sample size})\).
How interval estimates and hypothesis tests can be developed for situations involving two populations when the difference between the two population means or the two population proportions is of prime importance.
Example
Inferences About the Difference Between Two Population Means
Interval Estimation of \(({\mu}_1 - {\mu}_2)\)
The point estimator of the difference between the two population means \(({\mu}_1 - {\mu}_2)\) is the difference between the two sample means \(({\overline{x}}_1 - {\overline{x}}_2)\). Thus, \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) represents the difference of population means. It is given by equation (18.1)
Point Estimate :
\[E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = {\overline{x}}_1 - {\overline{x}}_2 \tag{18.1}\]
As with other point estimators, the point estimator \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) has a standard error \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\), that describes the variation in the sampling distribution of the estimator. It is the standard deviation of the sampling distribution of \(({\overline{x}}_1 - {\overline{x}}_2)\). Refer equation (18.2)
Standard Error of \(({\overline{x}}_1 - {\overline{x}}_2)\) :
\[{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{18.2}\]
16.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
In the case of estimation of the difference between two population means, an interval estimate will take the following form: \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} \, \pm \text{MOE}_{{\gamma}}\). Refer equation (18.3) and (18.4)
Margin of Error (\(\text{MOE}_{{\gamma}}\)) :
\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{18.3}\]
\(\text{Interval Estimate}_{\gamma}\) :
\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {z}_{\frac{{\alpha}}{2}} \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{18.4}\]
Example: Greystone: Difference between the mean
18.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)sqrt(9^2/36 + 10^2/49) \(\#\mathcal{R}\)Using \({D_0}\) to denote the hypothesized difference between \({\mu}_1\) and \({\mu}_2\), the three forms for a hypothesis test are as follows:
The test statistic for the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are known is given in equation (18.5)
Test Statistic for Hypothesis Tests :
\[z = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}}} \tag{18.5}\]
Example: Evaluate differences in education quality between two training centers
18.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(10^2/30 + 10^2/40) \(\#\mathcal{R}\)2 * pnorm(q = 1.66, lower.tail = FALSE) \(\#\mathcal{R}\)# #Get P(z) for z = 1.66 (Two-Tail)
#
# #Get the default (lower), subtract from 1, Double if Two-Tail
ii <- 2 * {1 - pnorm(q = 1.66)}
jj <- 2 * {1 - pnorm(q = 1.66, lower.tail = TRUE)}
#
# #Use the symmetry i.e. 'minus z' value, Double if Two-Tail
kk <- 2 * pnorm(q = -1.66)
#
# #Use the actual Upper Tail Option, Double if Two-Tail
ll <- 2 * pnorm(q = 1.66, lower.tail = FALSE)
#
stopifnot(all(identical(round(ii, 7), round(jj, 7)), identical(round(ii, 7), round(kk, 7)),
identical(round(ii, 7), round(ll, 7))))
ll
## [1] 0.09691445The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.
On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) cannot be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population).
18.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(2.2^2/50 + 3^2/35) \(\#\mathcal{R}\)18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
sqrt(5.2^2/40 + 6^2/50) \(\#\mathcal{R}\)pnorm(q = 2.03, lower.tail = FALSE) \(\#\mathcal{R}\)18.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(4.55^2/37 + 3.97^2/44) \(\#\mathcal{R}\)qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)pnorm(q = 4.13369, lower.tail = FALSE) \(\#\mathcal{R}\)18.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
sqrt(6^2/60 + 6^2/60) \(\#\mathcal{R}\)pnorm(q = -2.738613, lower.tail = TRUE) \(\#\mathcal{R}\)18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
pnorm(q = 0.2739, lower.tail = FALSE) \(\#\mathcal{R}\)18.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
pnorm(q = -1.826, lower.tail = TRUE) \(\#\mathcal{R}\)qnorm(p = 0.05, lower.tail = TRUE) \(\#\mathcal{R}\)Use the sample standard deviations, \({s}_1\) and \({s}_2\), to estimate the unknown population standard deviations \(({\sigma}_1, {\sigma}_2)\).
When \(({\sigma}_1, {\sigma}_2)\) are estimated by \(({s}_1, {s}_2)\), the t distribution is used to make inferences about the difference between two population means.
Interval Estimation of \(({\mu}_1 - {\mu}_2)\)
Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (18.6) like (18.3)
\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{18.6}\]
\(\text{Interval Estimate}_{\gamma}\) : Refer (18.7) like (18.4) using (18.2)
\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}} \tag{18.7}\]
Degrees of Freedom (DOF) : Refer (18.8)
\[\text{DOF} = \frac{ { \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right) }^2} {\frac{1}{n_1 - 1}{ \left( \frac{s_1^2}{n_1} \right) }^2 + \frac{1}{n_2 - 1}{ \left( \frac{s_2^2}{n_2} \right) }^2} \tag{18.8}\]
Example: Clearwater - To estimate the difference between the mean
18.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(150^2/28 + 125^2/22) \(\#\mathcal{R}\)floor({150^2 / 28 + 125^2 / 22 }^2 / {{150^2 / 28}^2/{28-1} + {125^2 / 22}^2/{22-1}}) \(\#\mathcal{R}\)qt(p = 0.025, df = 47, lower.tail = FALSE) \(\#\mathcal{R}\)Hypothesis Tests
Test Statistic for Hypothesis Tests : Refer (18.9) like (18.5) using (18.2)
\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}} \tag{18.9}\]
Caution: Use of \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\) symbol is probably wrong here because it should not represent formula containing \({s}_1, {s}_2\). “ForLater”
Software: To show that the new software will provide a shorter mean time
18.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
sqrt(40^2/12 + 44^2/12) \(\#\mathcal{R}\)floor({40^2 / 12 + 44^2 / 12 }^2 / {{40^2 / 12}^2/{12-1} + {44^2 / 12}^2/{12-1}}) \(\#\mathcal{R}\)pt(q = 2.272, df = 21, lower.tail = FALSE) \(\#\mathcal{R}\)CONVERT HERE to Tilde Based Option, to take advantage of Column Headers
# #Software
xxSoftware <- tibble(Old = c(300, 280, 344, 385, 372, 360, 288, 321, 376, 290, 301, 283),
New = c(274, 220, 308, 336, 198, 300, 315, 258, 318, 310, 332, 263))
aa <- xxSoftware
# Summary
aa %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key) %>%
summarise(across(value, list(Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 4
## key Count Mean SD
## <chr> <int> <dbl> <dbl>
## 1 New 12 286 44.0
## 2 Old 12 325 40.0
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "greater"
#bb_testT <- t.test(x = bb$Old, y = bb$New, alternative = bb_ha)
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha)
bb_testT
##
## Welch Two Sample t-test
##
## data: value by key
## t = -2.2721, df = 21.803, p-value = 0.9833
## alternative hypothesis: true difference in means between group New and group Old is greater than 0
## 95 percent confidence interval:
## -68.48569 Inf
## sample estimates:
## mean in group New mean in group Old
## 286 325
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = -2.272127)
cat(paste0("df is the degrees of freedom (df = ", round(bb_testT$parameter, 1), ")\n"))
## df is the degrees of freedom (df = 21.8)
cat(paste0("p-value is the significance level of the t-test (p-value = ",
round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.98335)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-68.486, Inf])
cat(paste0("sample estimates is the mean value of the samples. Mean: ",
paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 286, 325
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(any(all(bb_ha == "two.sided", bb_testT$p.value >= alpha / 2),
all(bb_ha != "two.sided", bb_testT$p.value >= alpha))) {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha,
"). We failed to reject H0. We cannot conclude that the populations are different.\n"))
} else {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha,
").\nWe can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.98335) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.Another approach used to make inferences about the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are unknown is based on the assumption that the two population standard deviations are equal \(({\sigma}_1 = {\sigma}_2 = {\sigma})\). Under this assumption, the two sample standard deviations are combined to provide the pooled sample variance as given in equation (18.10).
\[{s}_p^2 = \frac{({n}_1 - 1){s}_1^2 + ({n}_2 - 1){s}_2^2}{{n}_1 + {n}_2 - 2} \tag{18.10}\]
The t test statistic becomes (18.11) with \(({n}_1 + {n}_2 - 2)\) degrees of freedom.
\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{s}_p\sqrt{\frac{1}{{n}_1} + \frac{1}{{n}_2}}} \tag{18.11}\]
Then the computation of the p-value and the interpretation of the sample results are same as earlier.
(External) Unpaired Two-Samples T-test
# #Pooled Sample Variance
# #Evaluate differences in education quality between two training centers
# #Generate Data for Two Centers
set.seed(3)
setA <- rnorm(n = 30, mean = 82, sd = 10)
setB <- rnorm(n = 40, mean = 78, sd = 10)
bb <- tibble(sets = c(rep("setA", length(setA)), rep("setB", length(setB))), values = c(setA, setB))
# Summary
bb %>% group_by(sets) %>% summarise(Count = n(), Mean = mean(values), SD = sd(values))
## # A tibble: 2 x 4
## sets Count Mean SD
## <chr> <int> <dbl> <dbl>
## 1 setA 30 79.7 8.10
## 2 setB 40 78.8 9.43
#
# #Assumption 1: Are the two samples independents
# #YES
#
# #Assumtion 2: Are the data from each of the 2 groups follow a normal distribution
# #Shapiro-Wilk normality test
isNormal_A <- with(bb, shapiro.test(values[sets == "setA"]))
isNormal_B <- with(bb, shapiro.test(values[sets == "setB"]))
#isNormal_A
#isNormal_A$statistic
# #p-value > 0.05 is needed for Normality
isNormal_A$p.value
## [1] 0.1032747
isNormal_B$p.value
## [1] 0.15595
#
# #Both have p-values greater than the significance level alpha = 0.05.
# #implying that the distribution of the data are not significantly different from the normal
# #In other words, we can assume the normality.
#
bb %>%
group_by(sets) %>%
summarise(p = shapiro.test(values)$p.value)
## # A tibble: 2 x 2
## sets p
## <chr> <dbl>
## 1 setA 0.103
## 2 setB 0.156
#
# #Assumption 3. Do the two populations have the same variances
# #We will use F-test to test for homogeneity in variances. (p-value > 0.05 is needed)
#
bb_testF <- var.test(values ~ sets, data = bb)
# bb_testF
bb_testF$p.value
## [1] 0.3990791
#
# #The p-value of F-test is greater than the significance level alpha = 0.05.
# #In conclusion, there is no significant difference between the variances of the two sets of data.
# #Therefore, we can use the classic t-test witch assume equality of the two variances.
#
# #Compute unpaired two-samples t-test
#
# #Question : Is there any significant difference between the mean of two populations
#
bb_testT <- t.test(formula = values ~ sets, data = bb, var.equal = TRUE)
bb_testT
##
## Two Sample t-test
##
## data: values by sets
## t = 0.41947, df = 68, p-value = 0.6762
## alternative hypothesis: true difference in means between group setA and group setB is not equal to 0
## 95 percent confidence interval:
## -3.383982 5.185373
## sample estimates:
## mean in group setA mean in group setB
## 79.66783 78.76714
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = 0.419474)
cat(paste0("df (n1 + n2 - 2) is the degrees of freedom (df = ", bb_testT$parameter, ")\n"))
## df (n1 + n2 - 2) is the degrees of freedom (df = 68)
cat(paste0("p-value is the significance level of the t-test (p-value = ",
round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.676192)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-3.384, 5.185])
cat(paste0("sample estimates is the mean value of the samples. Mean: ",
paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 79.67, 78.77
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(bb_testT$p.value >= alpha) {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha,
"). We failed to reject H0. We cannot conclude that the populations are different.\n"))
} else {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha,
")\n. We can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.676192) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.bb <- xxSoftware
str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
## $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
## $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #Applying Multiple Functions with Summarise but Output as Cross-Table
# #(Original) Columns as Rows, Functions as Columns
# #NOTE: n() can be applied as lambda function
bb %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key) %>%
summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 5
## key N Count Mean SD
## <chr> <int> <int> <dbl> <dbl>
## 1 New 12 12 286 44.0
## 2 Old 12 12 325 40.0str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
## $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
## $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #gather() is deprecated. Here is for reference.
# #Longer Tibble is filled with All Values of Col A, then All Values fo Col B and so on
ii <- gather(bb)
jj <- bb %>% gather("key", "value")
kk <- bb %>% gather("key", "value", everything())
#
# #pivot_longer()
# #Longer Tibble is filled with First Row of All Columns, then 2nd Row of All Columns and so on
ll <- bb %>% pivot_longer(everything(), names_to = "key", values_to = "value") %>% arrange(desc(key))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll)))str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
## $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
## $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
# #Store a Grouped Tibble
ii <- bb %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key)
str(ii)
## grouped_df [24 x 2] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ key : chr [1:24] "Old" "New" "Old" "New" ...
## $ value: num [1:24] 300 274 280 220 344 308 385 336 372 198 ...
## - attr(*, "groups")= tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
## ..$ key : chr [1:2] "New" "Old"
## ..$ .rows: list<int> [1:2]
## .. ..$ : int [1:12] 2 4 6 8 10 12 14 16 18 20 ...
## .. ..$ : int [1:12] 1 3 5 7 9 11 13 15 17 19 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
ii %>% summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd),
.names = "{.fn}"))
## # A tibble: 2 x 5
## key N Count Mean SD
## <chr> <int> <int> <dbl> <dbl>
## 1 New 12 12 286 44.0
## 2 Old 12 12 325 40.0
#
# #Equivalent (except Column Headers)
ii %>% summarise(N = n(), Count = across(value, length),
Mean = across(value, mean), SD = across(value, sd))
## # A tibble: 2 x 5
## key N Count$value Mean$value SD$value
## <chr> <int> <int> <dbl> <dbl>
## 1 New 12 12 286 44.0
## 2 Old 12 12 325 40.0“ForLater”
sqrt(2.5^2/20 + 4.8^2/30) \(\#\mathcal{R}\)floor({2.5^2 / 20 + 4.8^2 / 30 }^2 / {{2.5^2 / 20}^2/{20-1} + {4.8^2 / 30}^2/{30-1}}) \(\#\mathcal{R}\)qt(p = 0.025, df = 45, lower.tail = FALSE) \(\#\mathcal{R}\)18.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(5.2^2/35 + 8.5^2/40) \(\#\mathcal{R}\)floor({5.2^2 / 35 + 8.5^2 / 40 }^2 / {{5.2^2 / 35}^2/{35-1} + {8.5^2 / 40}^2/{40-1}}) \(\#\mathcal{R}\)pt(q = 2.179, df = 65, lower.tail = FALSE) \(\#\mathcal{R}\)Suppose employees at a manufacturing company can use two different methods to perform a production task. To maximize production output, the company wants to identify the method with the smaller population mean completion time. We can use two alternative designs for the sampling procedure.
In the matched sample design the two production methods are tested under similar conditions (i.e., with the same workers); hence this design often leads to a smaller sampling error than the independent sample design.
The primary reason is that in a matched sample design, variation between workers is eliminated because the same workers are used for both production methods.
The key to the analysis of the matched sample design is to realize that we consider only the column of differences.
The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.
Let \({\mu}_d\) = the mean of the difference in values for the population
Sample Mean is given by (18.12) like (11.6) and Sample Standard Deviation is given by (18.13) like (11.12)
\[\overline{d} = \frac{\sum{{d}_i}}{n} \tag{18.12}\]
\[{s}_d = \sqrt{\frac{\sum ({d}_i - \overline{d})^2}{n-1}} \tag{18.13}\]
Test Statistic with \((n-1)\) degrees of freedom : Refer (18.14) like (17.3)
\[t = \frac{\overline{d} - {\mu}_d}{{s}_d/\sqrt{n}} \tag{18.14}\]
Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (18.15) like (18.3)
\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{18.15}\]
\(\text{Interval Estimate}_{\gamma}\) : Refer (18.16) like (18.4)
\[\text{Interval Estimate}_{\gamma} = \overline{d} \pm {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{18.16}\]
“ForLater” - Exercise
Example: Two Production Methods
{mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))} \(\#\mathcal{R}\)2 * pt(q = 2.196, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)qt(p = 0.025, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)# #Matched Samples: Same workers providing data for two methods
xxMatchedMethods <- tibble(M1 = c(6, 5, 7, 6.2, 6, 6.4),
M2 = c(5.4, 5.2, 6.5, 5.9, 6, 5.8))
aa <- xxMatchedMethods
# #Get Differnce
bb <- aa %>% mutate(d = M1-M2)
str(bb)
## tibble [6 x 3] (S3: tbl_df/tbl/data.frame)
## $ M1: num [1:6] 6 5 7 6.2 6 6.4
## $ M2: num [1:6] 5.4 5.2 6.5 5.9 6 5.8
## $ d : num [1:6] 0.6 -0.2 0.5 0.3 0 ...
paste0(round(bb[3], 1))
## [1] "c(0.6, -0.2, 0.5, 0.3, 0, 0.6)"
#
cat(paste0("- ${n} = ", length(bb$d), ", {\\overline{d}} = ", round(mean(bb$d), 1),
", {{s}_d} = ", round(sd(bb$d), 3), "$\n"))
## - ${n} = 6, {\overline{d}} = 0.3, {{s}_d} = 0.335$
cat(paste0("t = ", round({mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))}, 3), "\n"))
## t = 2.196
#
# #Paired t-test
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "two.sided"
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha, paired = TRUE)
bb_testT
##
## Paired t-test
##
## data: value by key
## t = 2.1958, df = 5, p-value = 0.07952
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05120834 0.65120834
## sample estimates:
## mean of the differences
## 0.3To make an inference about the difference between the two population proportions \(({p}_1 - {p}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. Let \(\overline{{p}}_1, \overline{{p}}_2\) denote the sample proportions for simple random sample from populations 1 and 2.
\(({x})\) denotes Count of Success
Interval Estimation of \(({p}_1 - {p}_2)\)
The point estimator of the difference between two population proportions is the difference between the sample proportions of two independent simple random samples.
Point Estimate : Refer (18.17) like (18.1)
\[E_{( {\overline{p}}_1 - {\overline{p}}_2 )} = {\overline{p}}_1 - {\overline{p}}_2 \tag{18.17}\]
As with other point estimators, the point estimator \(E_{( {\overline{p}}_1 - {\overline{p}}_2 )}\) has a standard error \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}\), that describes the variation in the sampling distribution of the estimator.
The two population proportions, \(({p}_1, {p}_2)\), are unknown. Thus, sample proportions \(({\overline{p}}_1, {\overline{p}}_2)\) are being used to estimate them.
Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : Refer equation (18.18) like (18.2)
\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{\frac{{p}_1 (1-{p}_1)}{{n}_1} + \frac{{p}_2 (1-{p}_2)}{{n}_2}} \\ &= \sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \end{align} \tag{18.18}\]
Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer equation (18.19) like (18.3)
\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{18.19}\]
\(\text{Interval Estimate}_{\gamma}\) : Refer equation (18.20) like (18.4)
\[\text{Interval Estimate}_{\gamma} = ({\overline{p}}_1 - {\overline{p}}_2) \pm {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{18.20}\]
Example: Tax Preparation: (Count of Success \(({x})\) is Number of Returns with Errors)
qnorm(p = 0.05, lower.tail = FALSE) \(\#\mathcal{R}\)sqrt(0.14 * (1 - 0.14) / 250 + 0.09 * (1 - 0.09) / 300) \(\#\mathcal{R}\)When we assume \({H_0}\) is true as an equality, we have \({p}_1 - {p}_2 = 0\), which is the same as saying that the population proportions are equal, \({p}_1 = {p}_2 = {p}\). The equation (18.18) becomes (18.21)
Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : When \({p}_1 = {p}_2 = {p}\)
\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{{p} (1-{p})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \\ &= \sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \end{align} \tag{18.21}\]
With \({p}\) unknown, we pool, or combine, the point estimators from the two samples \(({p}_1, {p}_2)\) to obtain a single point estimator of \({p}\) is given by (18.22)
Pooled Estimator of \({p}\) : When \({p}_1 = {p}_2 = {p}\)
\[{\overline{p}} = \frac{{n}_1 {\overline{p}}_1 + {n}_2 {\overline{p}}_2}{{n}_1 + {n}_2} \tag{18.22}\]
Test Statistic for Hypothesis Tests :
\[z = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}} = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{\sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)}} \tag{18.23}\]
Example: Tax Preparation: Continuation
18.11 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)
{250 * 0.14 + 300 * 0.09} / {250 + 300} \(\#\mathcal{R}\)2 * pnorm(q = 1.845, lower.tail = FALSE) \(\#\mathcal{R}\)(External) Two-Proportions Z-Test
bb_ha <- "two.sided"
bb_gamma <- 0.90
bb_propT <- prop.test(x = c(35, 27), n = c(250, 300), alternative = bb_ha,
conf.level = bb_gamma, correct = FALSE)
bb_propT
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(35, 27) out of c(250, 300)
## X-squared = 3.4084, df = 1, p-value = 0.06486
## alternative hypothesis: two.sided
## 90 percent confidence interval:
## 0.004815898 0.095184102
## sample estimates:
## prop 1 prop 2
## 0.14 0.09
names(bb_propT)
## [1] "statistic" "parameter" "p.value" "estimate" "null.value" "conf.int"
## [7] "alternative" "method" "data.name"
#
# #X-squared might be the square of calculated z-value
cat(paste0("X-squared is the number of Successes (X-squared = ", round(bb_propT$statistic, 6), ")\n"))
## X-squared is the number of Successes (X-squared = 3.408415)
#
# #By default, the function prop.test() uses the Yates continuity correction
# #It is important if either the expected successes or failures is < 5.
# #If you do not want the correction, use the additional argument correct = FALSE.
# #i.e. To make the test mathematically equivalent to the uncorrected z-test of a proportion.
cat(paste0("p-value is the significance level of the t-test (p-value = ",
round(bb_propT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.064865)
cat(paste0("conf.int is the confidence interval for the probability of success at", bb_gamma,
" level (conf.int = [", paste0(round(bb_propT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval for the probability of success at0.9 level (conf.int = [0.005, 0.095])
cat(paste0("sample estimates is the the estimated probability of success. p: ",
paste0(round(bb_propT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the the estimated probability of success. p: 0.14, 0.09In many manufacturing applications, controlling the process variance is extremely important in maintaining quality.
The sample variance \({s^2}\), given by equation (19.1), is the point estimator of the population variance \({\sigma}^2\).
\[{s^2} = \frac{\sum {({x}_i - {\overline{x}})}^2}{n-1} \tag{19.1}\]
Note:
Example: A sample of 20 containers \({n = 20}\) has the sample variance \({s^2} = 0.0025\)
qchisq(p = 0.025, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)qchisq(p = 0.975, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)# #pnorm() qnorm() | pt() qt() | pchisq() qchisq() | pf() qf()
#
# #p-value approach: Find Commulative Probability P corresponding to the given ChiSq & DOF=59
pchisq(q = 32.852, df = 19, lower.tail = FALSE)
## [1] 0.02500216
#
# #ChiSq value for which Area under the curve towards Right is alpha=0.025 & DOF=19 #32.852
qchisq(p = 0.025, df = 19, lower.tail = FALSE)
## [1] 32.85233\[\frac{(n-1)s^2}{{\chi_{0.025}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{0.975}^2}} \tag{19.2}\]
Generalising the equation (19.2), the equation (19.3) is the interval estimate of a population variance.
\[\frac{(n-1)s^2}{{\chi_{{\alpha}/2}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{{1-\alpha}/2}^2}} \tag{19.3}\]
where the \({\chi^2}\) values are based on a chi-square distribution with \({n-1}\) degrees of freedom and where \((1 − {\alpha})\) is the confidence coefficient.
Using \({{\sigma}_0^2}\) to denote the hypothesized value for the population variance, the three forms for a hypothesis test are as follows:
Note: In general, Upper Tail test is the most frequently observed test because low variances are generally desirable. With a statement about the maximum allowable population variance, we can test the null hypothesis that the population variance is less than or equal to the maximum allowable value against the alternative hypothesis that the population variance is greater than the maximum allowable value. With this test structure, corrective action will be taken whenever rejection of the null hypothesis indicates the presence of an excessive population variance.
Test Statistic for Hypothesis Tests About a Population Variance: Refer (19.4), where \({\chi^2}\) has a chi-square distribution with \({n - 1}\) degrees of freedom.
\[{\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} \tag{19.4}\]
Example: Louis: the company standard specifies an arrival time variance of 4 or less
19.3 \(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)
pchisq(q = 28.18, df = 23, lower.tail = FALSE) \(\#\mathcal{R}\)Example: bureau of motor vehicles: Evaluate the variance in the new examination test scores with the historical value \({\sigma}_0^2 = 100\)
19.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)
2 * pchisq(q = 46.98, df = 29, lower.tail = FALSE) \(\#\mathcal{R}\)The two sample variances \({s}_1^2\) and \({s}_2^2\) will be the basis for making inferences about the two population variances \({\sigma}_1^2\) and \({\sigma}_2^2\).
Note:
Test Statistic for Hypothesis Tests About Population Variances with \(({\sigma}_1^2 = {\sigma}_2^2)\) : Refer equation (19.5)
\[F = \frac{{s}_1^2}{{s}_2^2} \tag{19.5}\]
Hypothesis Tests :
Example: Dullus County Schools:
19.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)
2 * pf(q = 2.4, df1 = 25, df2 = 15, lower.tail = FALSE) \(\#\mathcal{R}\)Example: public opinion survey: do women show a greater variation in attitude on political issues than men
19.7 \(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)
pf(q = 1.5, df1 = 40, df2 = 30, lower.tail = FALSE) \(\#\mathcal{R}\)Hypothesis-testing procedures that expand our capacity for making statistical inferences about populations
where \({p}_j\) is population proportion of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments.
Select a random sample from each of the populations and record the observed frequencies, \(f_{ij}\), in a table with 2 rows and k columns.
Expected Frequencies Under the Assumption \({H_0}\) is true : Refer equation (20.1)
\[e_{ij} = \frac{(\text{Row } i \text{ Total})(\text{Column } j \text{ Total})}{\text{Total Sample Size}} \tag{20.1}\]
Chi-Square Test Statistic : Refer equation (20.2)
\[{\chi^2} = \sum_{i}{\sum_{j}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}}} \tag{20.2}\]
Where:
\[\begin{align} f_{ij} &= \text{observed frequency for the cell in row } i \text{ and column } j \\ e_{ij} &= \text{expected frequency for the cell in row } i \text{ and column } j \end{align}\]
Note: In a chi-square test involving the equality of \({k}\) population proportions, the above test statistic has a chi-square distribution with \({k - 1}\) degrees of freedom provided the expected frequency is 5 or more for each cell.
A chi-square test for equal population proportions will always be an upper tail test with rejection of \({H_0}\) occurring when the test statistic is in the upper tail of the chi-squre distribution.
In studies such as these, we often use the same sample size for each population. We have chosen different sample sizes in this example to show that the chi-square test is not restricted to equal sample sizes for each of the k populations.
“ForLater” - Creating the ChiSq Table
Example: JD Power: Compare customer loyalty for three automobiles by using the proportion of owners likely to repurchase a particular automobile
pchisq(q = 7.89, df = 2,lower.tail = FALSE) \(\#\mathcal{R}\)“ForLater” - Marascuilo procedure
An important application of a chi-square test involves using sample data to test for the independence of two categorical variables. The null hypothesis for this test is that the two categorical variables are independent. Thus, the test is referred to as a test of independence.
Example: Beer: Preference vs. gender
All tests apply to categorical variables and all tests use a chi-square \({\chi^2}\) test statistic that is based on the differences between observed frequencies and expected frequencies. In each case, expected frequencies are computed under the assumption that the null hypothesis is true. These chi-square tests are upper tailed tests. Large differences between observed and expected frequencies provide a large value for the chi-square test statistic and indicate that the null hypothesis should be rejected.
The test for the equality of population proportions for three or more populations is based on independent random samples selected from each of the populations. The sample data show the counts for each of two categorical responses for each population. The null hypothesis is that the population proportions are equal. Rejection of the null hypothesis supports the conclusion that the population proportions are not all equal.
The test of independence between two categorical variables uses one sample from a population with the data showing the counts for each combination of two categorical variables. The null hypothesis is that the two variables are independent and the test is referred to as a test of independence. If the null hypothesis is rejected, there is statistical evidence of an association or dependency between the two variables.
The goodness of fit test is used to test the hypothesis that a population has a specific historical or theoretical probability distribution. We showed applications for populations with a multinomial probability distribution and with a normal probability distribution. Since the normal probability distribution applies to continuous data, intervals of data values were established to create the categories for the categorical variable required for the goodness of fit test.
Example: Chemitech: Comparision of theree methods of assembly A, B, C in terms of most assemblies per week
xxChemitech <- tibble(A = c(58, 64, 55, 66, 67),
B = c(58, 69, 71, 64, 68),
C = c(48, 57, 59, 47, 49))
aa <- xxChemitech
# #Summary
aa %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key) %>%
summarise(across(value,
list(Count = length, Mean = mean, SD = sd, Variance = var),
.names = "{.fn}"))
## # A tibble: 3 x 5
## key Count Mean SD Variance
## <chr> <int> <dbl> <dbl> <dbl>
## 1 A 5 62 5.24 27.5
## 2 B 5 66 5.15 26.5
## 3 C 5 52 5.57 3121.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
If \({H_0}\) is rejected, we cannot conclude that all population means are different. Rejecting \({H_0}\) means that at least two population means have different values.
Three assumptions are required to use analysis of variance.
If the sample sizes are equal, analysis of variance is not sensitive to departures from the assumption of normally distributed populations.
Normality:
If the means for the three populations are equal, we would expect the three sample means to be close together. The more the sample means differ, the stronger the evidence we have for the conclusion that the population means differ. In other words, if the variability among the sample means is “small,” it supports \({H_0}\); if the variability among the sample means is “large,” it supports \({H_a}\).
If the null hypothesis is true, we can use the variability among the sample means to develop an estimate of \({\sigma}^2\).
First, note that if the assumptions for analysis of variance are satisfied and the null hypothesis is true, each sample will have come from the same normal distribution with mean \({\mu}\) and variance \({\sigma}^2\).
Recall that the sampling distribution of the sample mean \({\overline{x}}\) for a simple random sample of size \({n}\) from a normal population will be normally distributed with mean \({\mu}\) and variance \({\sigma}_{{\overline{x}}}^2 = \frac{{\sigma}^2}{n}\).
In this case, the mean and variance of the three sample mean values \(\{{\overline{x}}_1 = 62, {\overline{x}}_1 = 66, {\overline{x}}_1 = 52\}\) can be used to estimate the mean and variance of the sampling distribution.
When the sample sizes are equal, as in this example, the best estimate of the mean of the sampling distribution of \({\overline{x}}\) is the mean or average of the sample means.
In this example, an estimate of the mean of the sampling distribution of \({\overline{x}}\) is \((62 + 66 + 52)/3 = 60\). We refer to this estimate as the overall sample mean. Refer equation (21.5)
An estimate of the variance of the sampling distribution of \({\overline{x}}\), \({\sigma}_{{\overline{x}}}^2\), is provided by the variance of the three sample means.
\[{s}_{\overline{x}}^2 = \frac{(62 - 60)^2 + (66 - 60)^2 + (52 - 60)^2}{3 - 1} = 52\] Because \({\sigma}^2 = n {\sigma}_{{\overline{x}}}^2\), the estimate can be given by
\[E_{{\sigma}^2} = n E_{{\sigma}_{{\overline{x}}}^2} = n {s}_{\overline{x}}^2 = 5 * 52 = 260\]
The \(n {s}_{\overline{x}}^2\) is referred as between-treatments estimate of \({\sigma}^2\). It is based on the assumption that the null hypothesis is true. In this case, each sample comes from the same population, and there is only one sampling distribution of \({\overline{x}}\).
In contrast, when the population means are not equal, the between-treatments estimate will overestimate the population variance \({\sigma}^2\).
The variation within each of the samples also has an effect on the conclusion we reach in analysis of variance. When a simple random sample is selected from each population, each of the sample variances provides an unbiased estimate of \({\sigma}^2\). Hence, we can combine or pool the individual estimates of \({\sigma}^2\) into one overall estimate. The estimate of \({\sigma}^2\) obtained in this way is called the pooled or within-treatments estimate of \({\sigma}^2\).
Because each sample variance provides an estimate of \({\sigma}^2\) based only on the variation within each sample, the within-treatments estimate of \({\sigma}^2\) is not affected by whether the population means are equal. When the sample sizes are equal, the within-treatments estimate of \({\sigma}^2\) can be obtained by computing the average of the individual sample variances \(\{27.5, 26.5, 31\}\).
For this exmple we obtain:
\[\text{Within-treatments estimate of } {\sigma}^2 = \frac{27.5 + 26.5 + 31}{3} = 28.33\] Remember, that the between-treatments approach provides a good estimate of \({\sigma}^2\) only if the null hypothesis is true; if the null hypothesis is false, the between-treatments approach overestimates \({\sigma}^2\). The within-treatments approach provides a good estimate of \({\sigma}^2\) in either case.
Thus, if the null hypothesis is true, the two estimates will be similar and their ratio will be close to 1.
If the null hypothesis is false, the between-treatments estimate will be larger than the within-treatments estimate, and their ratio will be large.
Analysis of Variance and the Completely Randomized Design
where \({\mu}_j\) is mean of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments. For the resulting sample data, let
\[\begin{align} {x}_{ij} &= \text{value of observation } i \text{ for treatment } j \\ {n}_{j} &= \text{number of observations for treatment } j \\ {\overline{x}}_{j} &= \text{sample mean for treatment } j \\ {s}_{j}^2 &= \text{sample variance for treatment } j \\ {s}_{j} &= \text{sample e standard deviation for treatment } j \end{align}\]
The formulas for the sample mean and sample variance for treatment \({j}\) are given in equations (21.1) and (21.2)
\[{\overline{x}}_j = \frac{\sum_{i=1}^{n_j}{x}_{ij}}{{n}_j} \tag{21.1}\]
\[{s}_j^2 = \frac{\sum_{i=1}^{n_j}{\left({x}_{ij} - {\overline{x}}_j\right)^2}}{{n}_j - 1} \tag{21.2}\]
The overall sample mean, denoted \({\overline{\overline{x}}}\), is the sum of all the observations divided by the total number of observations.
\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{\sum_{i=1}^{{n}_j}{{x}_{ij}}}}{{n}_T} \tag{21.3}\]
Where
\[{n}_T = {n}_1 + {n}_2 + \cdots + {n}_k \tag{21.4}\]
If the size of each sample is {n}, the equation (21.4) becomes \({n}_T = kn\), and the equation (21.3) reduces to (21.5)
\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{{\overline{x}}_{j}}}{k} \tag{21.5}\]
Thus, whenever the sample sizes are the same, the overall sample mean is just the average of the \({k}\) sample means.
Thus, in the example, from (21.5), \({\bar{\bar{x}}} = \frac{62 + 66 + 52}{3} = 60\)
Between-Treatments Estimate of Population Variance
The between-treatments estimate of \({\sigma}^2\) is called the mean square due to treatments and is denoted \(\text{MSTR}\). Refer equation (21.6)
\[\text{MSTR} = \frac{\text{SSTR}}{k - 1} = \frac{\sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2}{k - 1} \tag{21.6}\]
The numerator in equation (21.6) is called the sum of squares due to treatments and is denoted \(\text{SSTR}\). The denominator, \({k − 1}\), represents the degrees of freedom associated with SSTR. Refer equation (21.7)
\[\text{SSTR} = \sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2 \tag{21.7}\]
If \({H_0}\) is true, MSTR provides an unbiased estimate of \({\sigma}^2\). However, if the means of the \({k}\) populations are not equal, MSTR is not an unbiased estimate of \({\sigma}^2\); in fact, in that case, MSTR should overestimate \({\sigma}^2\).
In the example:
Within-Treatments Estimate of Population Variance
The within-treatments estimate of \({\sigma}^2\) is called the mean square due to error and is denoted \(\text{MSE}\). Refer equation (21.8)
\[\text{MSE} = \frac{\text{SSE}}{{n}_T - k} = \frac{\sum_{j=1}^{k}{({n}_j - 1){s}_j^2}}{{n}_T - k} \tag{21.8}\]
The numerator in equation (21.8) is called the sum of squares due to error and is denoted \(\text{SSE}\). The denominator, \({{n}_T - k}\) is referred to as the degrees of freedom associated with SSE. Refer equation (21.9)
\[\text{SSE} = \sum_{j=1}^{k}{({n}_j - 1){s}_j^2} \tag{21.9}\]
In the example:
Comparing the Variance Estimates
If the null hypothesis is true, \(\text{MSTR}\) and \(\text{MSE}\) provide two independent, unbiased estimates of \({\sigma}^2\).
Refer Chapter: Variance.
We know that for normal populations, the sampling distribution of the ratio of two independent estimates of \({\sigma}^2\) follows an F distribution. Hence, if the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of \(\frac{\text{MSTR}}{\text{MSE}}\) is an F distribution with numerator degrees of freedom equal to \({k - 1}\) and denominator degrees of freedom equal to \({{n}_T - k}\).
In other words, if the null hypothesis is true, the value of MSTR/MSE should appear to have been selected from this F distribution. However, if the null hypothesis is false, the value of \(\frac{\text{MSTR}}{\text{MSE}}\) will be inflated because MSTR overestimates \({\sigma}^2\). Hence, we will reject \({H_0}\) if the resulting value of \(\frac{\text{MSTR}}{\text{MSE}}\) appears to be too large to have been selected from an F distribution with \({k - 1}\) numerator degrees of freedom and \({{n}_T - k}\) denominator degrees of freedom.
Because the decision to reject \({H_0}\) is based on the value of \(\frac{\text{MSTR}}{\text{MSE}}\), the test statistic used to test for the equality of \({k}\) population means is given by equation (21.10)
Test Statistic for the Equality of \({k}\) Population Means :
\[F = \frac{\text{MSTR}}{\text{MSE}} \tag{21.10}\]
Because we will only reject the null hypothesis for large values of the test statistic, the p-value is the upper tail area of the F distribution to the right of the test statistic \({F}\).
pf(q = 9.18, df1 = 2, df2 = 12, lower.tail = FALSE) \(\#\mathcal{R}\)The sum of squares associated with the source of variation referred to as “Total” is called the total sum of squares (SST). SST divided by its degrees of freedom \({n}_T - 1\) is nothing more than the overall sample variance that would be obtained if we treated the entire set of 15 observations as one data set. With the entire data set as one sample. Refer equation (21.11)
\[\text{SST} = \text{SSTR} + \text{SSE} = \sum_{j=1}^k{\sum_{i=1}^{{n}_j}{\left( {x}_{ij} - \bar{\bar{x}}\right)^2}} \tag{21.11}\]
The degrees of freedom associated with this total sum of squares is the sum of the degrees of freedom associated with the sum of squares due to treatments and the sum of squares due to error i.e. \({n}_T - 1 = (k - 1) + ({n}_T - k)\).
In other words, SST can be partitioned into two sums of squares: the sum of squares due to treatments and the sum of squares due to error. Note also that the degrees of freedom corresponding to SST, \({n}_T - 1\), can be partitioned into the degrees of freedom corresponding to SSTR, \(k - 1\), and the degrees of freedom corresponding to SSE, \({n}_T - k\).
The analysis of variance can be viewed as the process of partitioning the total sum of squares and the degrees of freedom into their corresponding sources: treatments and error. Dividing the sum of squares by the appropriate degrees of freedom provides the variance estimates, the F value, and the p-value used to test the hypothesis of equal population means.
The square root of MSE provides the best estimate of the population standard deviation \({\sigma}\). This estimate of \({\sigma}\) on the computer output is Pooled StDev.
str(aa)
## tibble [5 x 3] (S3: tbl_df/tbl/data.frame)
## $ A: num [1:5] 58 64 55 66 67
## $ B: num [1:5] 58 69 71 64 68
## $ C: num [1:5] 48 57 59 47 49
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #ANOVA
ii_aov <- aov(formula = value ~ key, data = bb)
#names(ii_aov)
#ii_aov
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##
## 60
##
## key
## key
## A B C
## 62 66 52
#
# #Summary
summary(ii_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## key 2 520 260.00 9.176 0.00382 **
## Residuals 12 340 28.33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Analysis of variance (ANOVA) can be used to test for differences among means of several populations or treatments.
The completely randomized design and the randomized block design are used to draw conclusions about differences in the means of a single factor. The primary purpose of blocking in the randomized block design is to remove extraneous sources of variation from the error term. Such blocking provides a better estimate of the true error variance and a better test to determine whether the population or treatment means of the factor differ significantly.
The basis for the statistical tests used in analysis of variance and experimental design is the development of two independent estimates of the population variance \({\sigma}^2\). In the single-factor case, one estimator is based on the variation between the treatments; this estimator provides an unbiased estimate of \({\sigma}^2\) only if the means \(\{{\mu}_1, {\mu}_2, \ldots, {\mu}_k\}\) are all equal. A second estimator of \({\sigma}^2\) is based on the variation of the observations within each sample; this estimator will always provide an unbiased estimate of \({\sigma}^2\).
By computing the ratio of these two estimators (the F statistic), it is determined whether to reject the null hypothesis that the population or treatment means are equal.
In all the experimental designs considered, the partitioning of the sum of squares and degrees of freedom into their various sources enabled us to compute the appropriate values for the analysis of variance calculations and tests.
Further, Fisher LSD procedure and the Bonferroni adjustment can be used to perform pairwise comparisons to determine which means are different.
\[{y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon} \tag{22.1}\]
Note
Multiple regression analysis enables us to understand how a dependent variable is related to two or more independent variables.
Parametric methods mostly require quantitative data. However these are generally sometimes more powerful than nonparametric methods.
Most of the statistical methods referred to as parametric methods require quantitative data, while nonparametric methods allow inferences based on either categorical or quantitative data.
1.1: Vectors
Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.
1.2: Components
Members of a vector are called components.
1.3: Packages
Packages are the fundamental units of reproducible R code.
2.1: R-Markdown
R Markdown is a file format for making dynamic documents with R.
2.2: NA
NA is a logical constant of length 1 which contains a missing value indicator.
2.3: Factors
Factors are the data objects which are used to categorize the data and store it as levels.
2.4: Lists
Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.
2.5: DataFrame
Data Frames are lists with restriction that all elements of a data frame are of equal length.
7.1: H-Variances
\(\text{\{Variances\}} {H_0} : {\sigma}_1 = {\sigma}_2 = \dots = {\sigma}_k \iff {H_a}: \text{At least two variances differ.}\)
8.1: Imputation
Imputation is the process of replacing missing data with substituted values. Imputation preserves all cases by replacing missing data with an estimated value based on other available information.
9.1: Data
Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
9.2: Elements
Elements are the entities on which data are collected. (Generally ROWS)
9.3: Variable
A variable is a characteristic of interest for the elements. (Generally COLUMNS)
9.4: Observation
The set of measurements obtained for a particular element is called an observation.
9.5: Statistics
Statistics is the art and science of collecting, analysing, presenting, and interpreting data.
9.6: Scale-of-Measurement
The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.
9.7: Nominal-Scale
When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.
9.8: Ordinal-Scale
The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.
9.9: Interval-Scale
The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.
9.10: Ratio-Scale
The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.
9.11: Categorical-Data
Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.
9.12: Quantitative-Data
Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.
9.13: Discrete
Quantitative data that measure ‘how many’ are discrete.
9.14: Continuous
Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.
9.15: Cross-Sectional-Data
Cross-sectional data are data collected at the same or approximately the same point in time.
9.16: Time-Series-Data
Time-series data data are data collected over several time periods.
9.17: Observational-Study
In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.
9.18: Experiment
The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.
9.19: Descriptive-Statistics
Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.
9.20: Population
A population is the set of all elements of interest in a particular study.
9.21: Sample
A sample is a subset of the population.
9.22: Parameter-vs-Statistic
The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.
9.23: Census
The process of conducting a survey to collect data for the entire population is called a census.
9.24: Sample-Survey
The process of conducting a survey to collect data for a sample is called a sample survey.
9.25: Statistical-Inference
Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.
9.26: Analytics
Analytics is the scientific process of transforming data into insight for making better decisions.
9.27: Descriptive-Analytics
Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.
9.28: Predictive-Analytics
Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.
9.29: Prescriptive-Analytics
Prescriptive analytics is the set of analytical techniques that yield a best course of action.
9.30: Big-Data
Larger and more complex data sets are now often referred to as big data.
9.31: Data-Mining
Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.
10.1: Frequency-Distribution
A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.
10.2: Cross-Tab
A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.
11.1: Number
A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.
11.2: Prime
A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.
11.3: Parity-Odd-Even
Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.
11.4: Positive-Negative
An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.
11.5: Mersenne-Primes
Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)
11.6: Mean
Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).
11.7: Median
Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.
11.8: Geometric-Mean
The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.
11.9: Mode
The mode is the value that occurs with greatest frequency.
11.10: Percentile
A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 – p)% of the observations are greater than the \(p^{th}\) percentile.
11.11: Variance
The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
11.12: Standard-Deviation
The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
11.13: Skewness
Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
11.14: Tails
A tail refers to the tapering sides at either end of a distribution curve.
11.15: Kurtosis
Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal{N}}_{(\mu, \, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).
11.16: TheSample
A sample of \({n}\) observations given by \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).
11.17: z-Scores
The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
11.18: t-statistic
Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.
11.19: Chebyshev-Theorem
Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).
11.20: Empirical-Rule
Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
11.21: Outliers
Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
11.22: Covariance
Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.
11.23: Correlation-Coefficient
Correlation coefficient is a measure of linear association between two variables that takes on values between -1 and +1. Values near +1 indicate a strong positive linear relationship; values near -1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.
12.1: Probability
Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.
12.2: Random-Experiment
A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.
12.3: Sample-Space
The sample space for a random experiment is the set of all experimental outcomes.
12.4: Counting-Rule
Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)
12.5: Tree-Diagram
A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.
12.6: Factorial
The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)
12.7: Combinations
Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)
12.8: Permutations
Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)
12.9: Event
An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)
12.10: Complement
Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)
12.11: Union
Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)
12.12: Intersection
Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)
12.13: Mutually-Exclusive
Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)
12.14: Conditional-Probability
Conditional probability is the probability of an event given that another event already
12.14: Conditional-Probability
occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)
12.15: Events-Independent
Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)
13.1: Random-Variable
A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’
13.2: Discrete-Random-Variable
A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
13.3: Continuous-Random-Variable
A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.
13.4: Probability-Distribution
The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
13.5: Probability-Function
For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.
13.6: Expected-Value-Discrete
The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)
13.7: Variance-Discrete
The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)
13.8: Bivariate
A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.
14.1: Uniform-Probability-Distribution
Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.
14.2: Probability-Density-Function
The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.
14.3: Normal-Distribution
A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
14.4: Standard-Normal
A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)
15.1: Sampled-Population
The sampled population is the population from which the sample is drawn.
15.2: Frame
Frame is a list of the elements that the sample will be selected from.
15.3: Target-Population
The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.
15.4: SRS
A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)
15.5: Sampling-without-Replacement
Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.
15.6: Sampling-with-Replacement
Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
15.7: Random-Sample
A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.
15.8: Proportion
A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.
15.9: Point-Estimation
To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
15.10: Point-Estimator
A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
15.11: Point-Estimate
The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.
15.12: Sampling-Distribution
The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).
15.13: Standard-Error
In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
15.14: Sampling-Error
A sampling error is the difference between a population parameter and a sample statistic.
15.15: Central-Limit-Theorem
Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.
16.1: Interval-Estimate
Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
16.2: Confidence-Interval
Confidence interval is another name for an interval estimate. Normally it is given as \(({\gamma} = 1 - {\alpha})\). Ex: 95% confidence interval
16.3: Confidence-Coefficient
The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
16.4: t-distribution
When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.
16.5: Degrees-of-Freedom
The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).
17.1: Hypothesis-Testing
Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.
17.2: Hypothesis-Null
Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
17.3: Hypothesis-Alternative
Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.
17.4: Hypothesis-1T-Lower-Tail
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
17.5: Hypothesis-1T-Upper-Tail
\(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
17.6: Hypothesis-2T-Two-Tail
\(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
17.7: Error-Type-I
The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
17.8: Error-Type-II
The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
17.9: Level-of-Significance
The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
17.10: Significance-Tests
Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
17.11: Test-Statistic
Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.
17.12: Tailed-Test
A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.
17.13: One-Tailed-Test
One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.
17.14: 1s-known-sd
If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (17.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)
17.15: Approach-p-value
The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
17.16: p-value
A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).
17.17: Approach-Critical-Value
The critical value approach requires that we first determine a value for the test statistic called the critical value.
17.18: Critical-Value
Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).
17.19: Acceptance-Region
A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.
17.20: Margin-Error
The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{{\sigma}}{\sqrt{n}}\)
17.21: Rejection-Region
A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.
17.22: Two-Tailed-Test
Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.
17.23: Approach-p-value-Steps
p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret
17.24: 1s-unknown-sd
If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (17.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)
17.25: H-1s-p-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p} \geq {p}_0 \iff {H_a}: {p} < {p}_0\)
17.26: H-1s-p-Upper
\(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
17.27: H-1s-p-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p} = {p}_0 \iff {H_a}: {p} \neq {p}_0\)
17.28: Power
The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
17.29: Power-Curve
Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.
18.1: Independent-Simple-Random-Samples
Let \({\mathcal{N}}_{({\mu}_1, \, {\sigma}_1)}\) and \({\mathcal{N}}_{({\mu}_2, \, {\sigma}_2)}\) be the two populations. To make an inference about the difference between the means \(({\mu}_1 - {\mu}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. The two samples, taken separately and independently, are referred to as independent simple random samples.
18.2: H-2s-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
18.3: H-2s-Upper
\(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
18.4: H-2s-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
18.5: Shapiro-Wilk-Test
The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. (p-value should be greater than \({\alpha}\) for normality)
18.6: Independent-Sample-Design-Example
Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.
18.7: Matched-Sample-Design-Example
Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.
18.8: Hypo-Paired-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_d = 0 \iff {H_a}: {\mu}_d \neq 0\)
18.9: H-2s-p-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p}_1 - {p}_2 \geq 0 \iff {H_a}: {p}_1 - {p}_2 < 0\)
18.10: H-2s-p-Upper
\(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)
18.11: H-2s-p-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)
19.1: Distribution-Chi-Square
Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n - 1}\) degrees of freedom.
19.2: H-1s-Var-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\sigma}^2 \geq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 < {{\sigma}_0^2}\)
19.3: H-1s-Var-Upper
\(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)
19.4: H-1s-Var-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)
19.5: Distribution-F
Whenever independent simple random samples of sizes \({n}_1\) and \({n}_2\) are selected from two normal populations with equal variances \(({\sigma}_1^2 = {\sigma}_2^2)\), the sampling distribution of \(\frac{{s}_1^2}{{s}_2^2}\) is an F distribution with \(({n}_1 - 1)\) degrees of freedom for the numerator and \(({n}_2 - 1)\) degrees of freedom for the denominator.
19.6: H-2s-Var-Lower
\(\text{\{Left or Lower \} }\space\thinspace \text{Do not do this.}\)
19.7: H-2s-Var-Upper
\(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)
19.8: H-2s-Var-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)
20.1: H-3p
\(\text{\{Equality of Population Proportions \}} {H_0} : {p}_1 = {p}_2 = \dots = {p}_k \iff {H_a}: \text{Not all population proportions are equal}\)
21.1: Randomization
Randomization is the process of assigning the treatments to the experimental units at random.
21.2: H-ANOVA
\(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
22.1: Regression-Analysis
Regression analysis can be used to develop an equation showing how two or more variables variables are related.
22.2: Variable-Dependent
The variable being predicted is called the dependent variable \(({y})\).
22.3: Variable-Independent
The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).
22.4: Simple-Linear-Regression
The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.
22.5: Regression-Model
The equation that describes how \({y}\) is related to \({x}\) and an error term is called the regression model. For example, simple linear regression model is given by equation (22.1)
26.1: Parametric-Methods
Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).
26.2: Distribution-free-Methods
Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.
26.3: Nonparametric-Methods
Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.
1.1: cannot-open-connection
Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
1.2: need-finite-xlim
Error in plot.window(...) : need finite ’xlim’ values
1.3: par-old-par
Error in par(old.par) : invalid value specified for graphical parameter "pin"
2.1: plot-finite-xlim
Error in plot.window(...) : need finite ’xlim’ values
2.2: Function-Not-Found
Error in arrange(bb, day) : could not find function "arrange"
3.1: Object-Not-Found-01
Error in match.arg(method) : object ’day’ not found
3.2: Comparison-possible
Error in day == 1 : comparison (1) is possible only for atomic and list types
3.3: UseMethod-No-applicable-method
Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"
3.4: Object-Not-Found-02
Error: Problem with mutate() column ... column object ’arr_delay’ not found
10.1: gg-stat-count-geom-bar
Error: stat_count() can only have an x or y aesthetic.
14.1: ggplot-list
Error in is.finite(x) : default method not implemented for type ’list’
14.2: ggplot-data
Error: Must subset the data pronoun with a string.
18.1: t-test-grouping
Error in t.test.formula() : grouping factor must have exactly 2 levels